CPU load number database

I wrote some code to measure CPU load for every module under various conditions. I have the raw PRELIMINARY results here.

Goals

The goal is to create some new features on the MM hardware and in VCV that leverage pre-recorded CPU load data to make a reasonable estimate of CPU usage of a patch. For example, the MM firmware could tell you in real-time how much the CPU usage would change if you added a certain module. Or to be able to browse modules and choose the lowest CPU utilizer that does the function you need. Or when using VCV Rack, to display the estimated CPU usage if you were to run the patch on MM.

Test Data

I offer this up for comments, suggestions, and review. This is not the be-all-end-all source of cpu load numbers. This is just part of the CPU load picture. I will be continuing working on this and I will upload new improved data and/or new test conditions without warning! So, if you can handle some work-in-progress data, then please read on:


Updated Dec 4, 2024:
CPU load spreadsheet (params at 0.25, single core test, firmware v1.5.0, latest plugin versions <= 1.4.2):


In the spreadsheet you will see five columns (one for each block size) for each of the four tests:

  • Isolated: no jacks patched, all params (knobs, switches, etc) set to 0.25.
  • InputsZero: all input and output jacks patched, with 0V fed into all inputs. All params set to 0.25.
  • InputsLFOs: all inputs and outputs patched, with different frequency+phase triangle waves fed into each input (2Hz to 10Hz). All params set to 0.25.
  • InputsAudio: Same as InputsLFO except using audio-rate triangle waves (400Hz to 6kHz).

Conditions of the tests

I tested all built-in and plugin-in modules at each block size (32, 64, 128, 256, 512). For each module, the test procedure automatically created a patch with just the module-under-test. To handle Bogaudio modules, the modules were told to process one audio frame, and the timing of this was thrown out. Then the module was ā€œrunā€ 2048 times and the amount of time it took to process each audio frame was recorded. The worst-case time to process each block was recorded, and this number is divided by the block size to provide a worst-case average time to process a sample. Divide this by the amount of time we have per sample (20.8333us for 48kHz) and you get the CPU load for the module under the test conditions.
In between each time the module processes a sample, the code set and/or modulated the module params and inputs depending on which test was being run (see above descriptions of each test).

My reasoning to use 0.25 for the param values is that some buttons act like you are long-holding them if we set 0.5, and things like sliders might disable a channel if set to 0.

Interpreting the results

The number in each cell is the percent load for ONE CPU CORE at 48kHz.
There are two CPU cores. Each CPU core runs in parallel so you basically get to add up to 100% twice before the MM reports ā€œCPU usage > 99%ā€.

It might be tempting to mentally divide the load percentage number in halfā€¦ but, I think thatā€™s a bit misleading. For example if you have a module with 60% load, you could run two of them (60% on each core). But adding a third would mean one core is 120% and thatā€™s not allowed (but dividing by 2 would mean you think of 60% as 30% which implies you can have 3 x 30% = 90%).
Both cores have to be done before the audio processing is done. So if one core is at 50% and the other core is at 75%, then itā€™s no different than if both cores were at 75%.
There currently is no load-balancer so itā€™s possible to have a patch where one core is at 90% and the other is at 5%.
Beyond the module usage, each cable between modules adds a tiny bit of load. Also each knob mapping in the active knob set adds a tiny bit of load.

Obviously none of these tests are going to represent typical usage (that is to say, in most patches we donā€™t have ALL the jacks on ALL the modules being audio-rate modulated, nor do we have NONE of the jacks patched) ā€“ but itā€™s probably safe to say that whatever your patch looks like, itā€™s somewhere between the two extremes.

Predicting CPU usage is extremely complex as all the modules and the two CPU cores plus the various levels of memory cache and CPU buses all interact to produce one single percentage. As much as it would be awesome to say ā€œIā€™m using modules A, B, and C, so Iā€™ll just add up three numbers that will be my exact CPU loadā€, itā€™s not going to be that easy. This is modular: we like to make complex webs of patch cables and experiment with unorthodox signal flows, and itā€™s not a simple thing to predict what will happen. Which is part of the fun, too.

Possible improvements to the tests

More Tests:

  • Another test we could do is modulate all the params and see how that effects CPU load.
  • Another set of tests could be using white noise into all the jacks.
  • Another set of tests could be done by modulating a single input, or a single knob at time

Running the tests differently:

  • The tests could be run multiple times, in random orders to weed out bus or cache delays
  • A patch with lots of modules could be automatically created and run, but only the time taken for the module-under-test would be recorded. This would more realistically simulate how the cache would be loaded in real-world patches.
  • The way I do it now, Iā€™m running the module 2048 times for each block size. A better way would be to run it 2048 and then group it into block sizes, then repeat that 5 times and report the worst-case.
  • Integrate the tests into a HIL (hardware-in-the-loop) test that runs automatically for each firmware and plugin release, updating the database and flagging any drastic changes in load.
15 Likes

Excellent work here - much appreciated!

Thank you, Dan. Incredibly helpful information.

This is very useful. Thank you for the ground breaking work you are doing here Dan - a one-man revolution!

brilliant, ty! insanely useful info @danngreen !

Good idea.

I can already see thereā€™s some performance issue with kocmoc LADR filter. It should actually be the fastest filter module from kocmoc since itā€™s just doing explicit integration in default mode.

Yeah, there are some surprising numbers here!

I would recommend testing any surprising numbers against a real-world patch containing just the module in question (which I just did with the LADR, and itā€™s about right if you consider a 7% ā€œbase loadā€). Using the LADDER_EULER_FULL_TANH method is a lot less load (around 17%).

It could be just the compiler struggling to optimize a tight inner-loop out of the inlined tanh nightmare, Fundamental VCF is showing same-ish numbers on that spreadsheet and itā€™s a similar non-linear integrator.

GCC 14.x seems to optimize it very well, on ARM and on x86-64.

Some interesting numbers there. I made an interesting patch with the Befaco noise plethora so was interested to see the numbers on that. First construction got the 99% loaded message, after cutting down, managed to load with a load with a 80%(ish) CPU load, soon as I moved one of the linked knobs - 99% message. Cut down again, used my external 4ms EncVCA, one Osc instead of two, patch now runs at about 56% CPU load - seems OK.

I can try profiling it with gcc 13.3. Unfortunately gcc 14.x is not available for the 32-bit cortex-a7 yet. Hereā€™s the actual flags and compiler used:

(And a float version, which looks equally hairy. And the imprecision might build up a lot, so Iā€™m not sure if thatā€™s worth it).

Eliminating branching in the inline tanh approximant seems to let the compiler do its thing both on 12.x and 14.x equally well:

The approximant remains mathematically bound from above and below so it shouldnā€™t impact stability and quick testing seems to confirm.

Very nice! Load is down to 20% with the branchless TanhPade32

1 Like

Watch out though, the approximant in TanhPade32 is not a bounded function without the clamping and you could get instabilities.

You could try (3*x)/(x^2 + 3) instead (this is a 2/3 order approximant instead of 3/2) if you want to keep it low order.

@danngreen this is great. Would be great to add a ā€œheat mapā€ to the Excel version. You can just ask Excel to color code each cell with a color from red to green depending on CPU utilization. Will give us the ability to quickly see the expensive ones.

+1 to auto color coding! i went ahead and did it manually for everything less than 14%. its been a massive help to discover/integrate new modules i didnā€™t know about before. but it def took quite a while to color it manually.

3 Likes

Iā€™m no Excel expert but a few clicks with conditional formatting gives a fair result without much effort

3 Likes

Iā€™m not too surprised about the CPU use for a bunch of the more complex modules - reverbs, filters, Noise Plethora etc. although there do seem to be some outliers. Befaco Spring Reverb is really out there.

Iā€™m pretty surprised at how much CPU some of the basic utilities use, things like VCAs, mixers & LFOs. EG the Fundamental VCA is ~10% CPU on its own. I wonder if some of these might be candidates for optimization? Maybe even special versions, e.g. a Bogaudio Four-FO that updates at a slower rate?

Itā€™d be nice if there were options for basic utilities that use a minimal amount of CPU to leave more room for fun & experimentation.

1 Like

Iā€™m not sure if it will always correspond, but it turns out the CPU use on my clunky old laptop reports about the same percent as on the module.
Thatā€™s handy!

1 Like

Yes, absolutely, we can optimize existing modules, starting with the most unique modules with heavy CPU usages.

We also can make new modules where we see gaps: simple stuff like VCAs are easy. Some of the 4ms ā€œgenericā€ modules are there just to fill in those gaps. Dual or triple/quad channel modules are also more efficient (even better is to offer the module in single, dual, quad, etcā€¦ forms).

Another tip for patching is to try to use multi-channel modules. For instance, if you need 4 VCAs in a patch, then 4 x Fundamental VCAs will net you 28%. But one Befaco HexMix or one 4ms VCAMatrix will be less than that.

Keep in mind these numbers represent extreme cases, too! So using all 6 channels of the HexMix is around 25% but using 4 channels will bring that down towards the first column of numbers (towards 8%).

3 Likes

also noticing v similar readouts between mm and this cheap ($80 ebay) asus vivobook (e210ma, w 2core n4020)- setting cpu to ā€œ0%ā€ in windows11 power profile does the trick.

awesome, hopefully we can see something like cloudsā€™ spectral mode optimized! as it is, it tanks the mm.

and looking at the ā€˜blank panelā€™ one in the cpu sheet as 4% across the board, itā€™s safe to assume that all modules, no matter how lean/mean their code, will automatically use 4% off the bat?

as we see more and more modules optimized specifically for mm, hopefully the 32 module limit can be raised. i find im hitting it in a few of my patches with cpu to spare.

thanks for the tip! im mostly luddite but hopefully i can figure something like this one out in google sheets