CPU load number database

danngreen · September 15, 2024, 11:39pm

I wrote some code to measure CPU load for every module under various conditions. I have the raw results here.

Goals

The goal is to create some new features on the MM hardware and in VCV that leverage pre-recorded CPU load data to make a reasonable estimate of CPU usage of a patch. For example, the MM firmware could tell you in real-time how much the CPU usage would change if you added a certain module. Or to be able to browse modules and choose the lowest CPU utilizer that does the function you need. Or when using VCV Rack, to display the estimated CPU usage if you were to run the patch on MM.

Test Data

I offer this up for comments, suggestions, and review. This is not the be-all-end-all source of cpu load numbers. This is just part of the CPU load picture. I will be continuing working on this and I will upload new improved data and/or new test conditions without warning! So, if you can handle some work-in-progress data, then please read on:

Updated Dec 4, 2024:
CPU load spreadsheet (params at 0.25, single core test, firmware v1.5.0, latest plugin versions <= 1.4.2):

Updated April 4, 2025:
v2.0-dev test results:

In the spreadsheet you will see five columns (one for each block size) for each of the four tests:

Isolated: no jacks patched, all params (knobs, switches, etc) set to 0.25.
InputsZero: all input and output jacks patched, with 0V fed into all inputs. All params set to 0.25.
InputsLFOs: all inputs and outputs patched, with different frequency+phase triangle waves fed into each input (2Hz to 10Hz). All params set to 0.25.
InputsAudio: Same as InputsLFO except using audio-rate triangle waves (400Hz to 6kHz).

Conditions of the tests

I tested all built-in and plugin-in modules at each block size (32, 64, 128, 256, 512). For each module, the test procedure automatically created a patch with just the module-under-test. To handle Bogaudio modules, the modules were told to process one audio frame, and the timing of this was thrown out. Then the module was “run” 2048 times and the amount of time it took to process each audio frame was recorded. The worst-case time to process each block was recorded, and this number is divided by the block size to provide a worst-case average time to process a sample. Divide this by the amount of time we have per sample (20.8333us for 48kHz) and you get the CPU load for the module under the test conditions.
In between each time the module processes a sample, the code set and/or modulated the module params and inputs depending on which test was being run (see above descriptions of each test).

My reasoning to use 0.25 for the param values is that some buttons act like you are long-holding them if we set 0.5, and things like sliders might disable a channel if set to 0.

Interpreting the results

The number in each cell is the percent load for ONE CPU CORE at 48kHz.
There are two CPU cores. Each CPU core runs in parallel so you basically get to add up to 100% twice before the MM reports “CPU usage > 99%”.

It might be tempting to mentally divide the load percentage number in half… but, I think that’s a bit misleading. For example if you have a module with 60% load, you could run two of them (60% on each core). But adding a third would mean one core is 120% and that’s not allowed (but dividing by 2 would mean you think of 60% as 30% which implies you can have 3 x 30% = 90%).
Both cores have to be done before the audio processing is done. So if one core is at 50% and the other core is at 75%, then it’s no different than if both cores were at 75%.
There currently is no load-balancer so it’s possible to have a patch where one core is at 90% and the other is at 5%.
Beyond the module usage, each cable between modules adds a tiny bit of load. Also each knob mapping in the active knob set adds a tiny bit of load.

Obviously none of these tests are going to represent typical usage (that is to say, in most patches we don’t have ALL the jacks on ALL the modules being audio-rate modulated, nor do we have NONE of the jacks patched) – but it’s probably safe to say that whatever your patch looks like, it’s somewhere between the two extremes.

Predicting CPU usage is extremely complex as all the modules and the two CPU cores plus the various levels of memory cache and CPU buses all interact to produce one single percentage. As much as it would be awesome to say “I’m using modules A, B, and C, so I’ll just add up three numbers that will be my exact CPU load”, it’s not going to be that easy. This is modular: we like to make complex webs of patch cables and experiment with unorthodox signal flows, and it’s not a simple thing to predict what will happen. Which is part of the fun, too.

Possible improvements to the tests

More Tests:

Another test we could do is modulate all the params and see how that effects CPU load.
Another set of tests could be using white noise into all the jacks.
Another set of tests could be done by modulating a single input, or a single knob at time

Running the tests differently:

The tests could be run multiple times, in random orders to weed out bus or cache delays
A patch with lots of modules could be automatically created and run, but only the time taken for the module-under-test would be recorded. This would more realistically simulate how the cache would be loaded in real-world patches.
The way I do it now, I’m running the module 2048 times for each block size. A better way would be to run it 2048 and then group it into block sizes, then repeat that 5 times and report the worst-case.
Integrate the tests into a HIL (hardware-in-the-loop) test that runs automatically for each firmware and plugin release, updating the database and flagging any drastic changes in load.

InvictusHifi · September 16, 2024, 5:15am

Excellent work here - much appreciated!

Nik · September 16, 2024, 6:14am

Thank you, Dan. Incredibly helpful information.

saulbass · September 16, 2024, 1:09pm

This is very useful. Thank you for the ground breaking work you are doing here Dan - a one-man revolution!

offthesky · September 16, 2024, 2:58pm

brilliant, ty! insanely useful info @danngreen !

janne808 · September 16, 2024, 3:27pm

Good idea.

I can already see there’s some performance issue with kocmoc LADR filter. It should actually be the fastest filter module from kocmoc since it’s just doing explicit integration in default mode.

danngreen · September 16, 2024, 4:40pm

Yeah, there are some surprising numbers here!

I would recommend testing any surprising numbers against a real-world patch containing just the module in question (which I just did with the LADR, and it’s about right if you consider a 7% “base load”). Using the LADDER_EULER_FULL_TANH method is a lot less load (around 17%).

janne808 · September 16, 2024, 8:36pm

It could be just the compiler struggling to optimize a tight inner-loop out of the inlined tanh nightmare, Fundamental VCF is showing same-ish numbers on that spreadsheet and it’s a similar non-linear integrator.

GCC 14.x seems to optimize it very well, on ARM and on x86-64.

TheHandofLenin · September 16, 2024, 10:58pm

Some interesting numbers there. I made an interesting patch with the Befaco noise plethora so was interested to see the numbers on that. First construction got the 99% loaded message, after cutting down, managed to load with a load with a 80%(ish) CPU load, soon as I moved one of the linked knobs - 99% message. Cut down again, used my external 4ms EncVCA, one Osc instead of two, patch now runs at about 56% CPU load - seems OK.

danngreen · September 17, 2024, 4:15am

I can try profiling it with gcc 13.3. Unfortunately gcc 14.x is not available for the 32-bit cortex-a7 yet. Here’s the actual flags and compiler used:

(And a float version, which looks equally hairy. And the imprecision might build up a lot, so I’m not sure if that’s worth it).

janne808 · September 17, 2024, 5:59am

Eliminating branching in the inline tanh approximant seems to let the compiler do its thing both on 12.x and 14.x equally well:

The approximant remains mathematically bound from above and below so it shouldn’t impact stability and quick testing seems to confirm.

danngreen · September 17, 2024, 4:21pm

Very nice! Load is down to 20% with the branchless TanhPade32

janne808 · September 17, 2024, 5:41pm

Watch out though, the approximant in TanhPade32 is not a bounded function without the clamping and you could get instabilities.

You could try (3*x)/(x^2 + 3) instead (this is a 2/3 order approximant instead of 3/2) if you want to keep it low order.

thetenofswords · September 17, 2024, 10:09pm

@danngreen this is great. Would be great to add a “heat map” to the Excel version. You can just ask Excel to color code each cell with a color from red to green depending on CPU utilization. Will give us the ability to quickly see the expensive ones.

offthesky · September 17, 2024, 10:39pm

+1 to auto color coding! i went ahead and did it manually for everything less than 14%. its been a massive help to discover/integrate new modules i didn’t know about before. but it def took quite a while to color it manually.

paulpiko · September 17, 2024, 11:01pm

I’m no Excel expert but a few clicks with conditional formatting gives a fair result without much effort

trevormeier · September 18, 2024, 3:22pm

I’m not too surprised about the CPU use for a bunch of the more complex modules - reverbs, filters, Noise Plethora etc. although there do seem to be some outliers. Befaco Spring Reverb is really out there.

I’m pretty surprised at how much CPU some of the basic utilities use, things like VCAs, mixers & LFOs. EG the Fundamental VCA is ~10% CPU on its own. I wonder if some of these might be candidates for optimization? Maybe even special versions, e.g. a Bogaudio Four-FO that updates at a slower rate?

It’d be nice if there were options for basic utilities that use a minimal amount of CPU to leave more room for fun & experimentation.

Zymos · September 18, 2024, 4:00pm

I’m not sure if it will always correspond, but it turns out the CPU use on my clunky old laptop reports about the same percent as on the module.
That’s handy!

danngreen · September 18, 2024, 4:21pm

Yes, absolutely, we can optimize existing modules, starting with the most unique modules with heavy CPU usages.

We also can make new modules where we see gaps: simple stuff like VCAs are easy. Some of the 4ms “generic” modules are there just to fill in those gaps. Dual or triple/quad channel modules are also more efficient (even better is to offer the module in single, dual, quad, etc… forms).

Another tip for patching is to try to use multi-channel modules. For instance, if you need 4 VCAs in a patch, then 4 x Fundamental VCAs will net you 28%. But one Befaco HexMix or one 4ms VCAMatrix will be less than that.

Keep in mind these numbers represent extreme cases, too! So using all 6 channels of the HexMix is around 25% but using 4 channels will bring that down towards the first column of numbers (towards 8%).

offthesky · September 18, 2024, 7:14pm

also noticing v similar readouts between mm and this cheap ($80 ebay) asus vivobook (e210ma, w 2core n4020)- setting cpu to “0%” in windows11 power profile does the trick.

awesome, hopefully we can see something like clouds’ spectral mode optimized! as it is, it tanks the mm.

and looking at the ‘blank panel’ one in the cpu sheet as 4% across the board, it’s safe to assume that all modules, no matter how lean/mean their code, will automatically use 4% off the bat?

as we see more and more modules optimized specifically for mm, hopefully the 32 module limit can be raised. i find im hitting it in a few of my patches with cpu to spare.

thanks for the tip! im mostly luddite but hopefully i can figure something like this one out in google sheets