Learning about simd

etcetc · April 13, 2025, 5:14pm

Wondering if there’s more literature or guidance on how simd works in the Metamodule?

Specifically I was looking into porting this, a 64 partial harmonic oscillator. It looks quite optimized already, but MM has some trouble just on the high ends of the settings for this (no oversampling)

github.com

ianjhoffman/RigatoniModular/blob/master/src/loom.cpp

#include "plugin.hpp"
#include "sequencer/EuclideanPatternGenerator.hpp"
#include "math/Utility.hpp"
#include "dsp/HighOrderLinearBlep.hpp"
#include "dsp/OversampledAlgorithm.hpp"
#include "dsp/WaveshapingADAA1.hpp"

#include <algorithm>
#include <array>
#include <climits>
#include <cmath>
#include <utility>

using simd::float_4;

enum ContinuousStrideMode {
    OFF,
    SYNC,
    FREE
};

This file has been truncated. show original

I see that they use some simd operations, which I didn’t need to polyfill with simde, but not sure if they are actually running on the MM or if I have to use something else.

danngreen · April 14, 2025, 5:49pm

SIMD on the MM is done with NEON.
There’s lot of info about it, ARM’s introduction is a good starting place: Documentation – Arm Developer

The simde library supports NEON, so it may be generating NEON SIMD instructions already. The best way to know for sure is to inspect the assembly and look for SIMD instructions

Harnessing SIMD is a complex topic. You have to take a data-orientated approach, that is, the layout of your data structures is critically important as to whether the compiler will be able to use SIMD or not. To add two groups of four floats, for example, all the floats need to be stored at a particular alignment, and also the groups of 4 must be consecutive in memory, in the correct order. The float_4 type will take care of keeping the floats together in memory, but the alignment thing is harder to make happen (the compiler flags for alignment are easy to get wrong and sometimes are silently ignored).

To check the assembly, I put in a cmake target to generate the disassembly. It takes a while to run and makes a file that might be several hundred megabytes, so be patient. If you want the file to be smaller (recommended!), modify your cmake file to only build the module in question.

Do this from the plugin’s project root:

cmake --build build -- debugdiss

In build/ you’ll see a file ending in *-debug.so.diss. This can be opened in a text editor. Also related are *-debug.so.readelf and *-debug.so.nm which contain the result of readelf and nm. Handy for figuring out why a particular symbol is missing (or is included) and for checking the memory layout of the plugin.

Open it up and look for NEON instructions (the ARM site has the complete list). The quad instructions use the qX registers. If you see just sX registers, that’s just normal floating point (single-precision floats). If you see dX registers, that could be due to using doubles or it could be due to dual-lane SIMD instructions.

For example, I just dissassembled Venom, and I scrolled through and saw this

vdupq_n_f32 (float32_t __a)
{
  return (float32x4_t) {__a, __a, __a, __a};
   37764:	f3b40c40 	vdup.32	q0, d0[0]
   37768:	f3f42c47 	vdup.32	q9, d7[0]
				this->b[0] = K * K * norm;
   3776c:	f4040adf 	vst1.64	{d0-d1}, [r4 :64]
   37770:	f3fc0c47 	vdup.32	q8, d7[1]
  return (float32x4_t) __builtin_neon_vmulfv4sf (__a, __b);
   37774:	f3444dd0 	vmul.f32	q10, q10, q0

You don’t have to understand the assembly to know that this function is using SIMD. There’s lots of mentions of the q registers, including the last line which operates exclusively on q registers.

Note sometimes you will see the use of q or d registers, and it has nothing to do with SIMD or even math. The compiler will sometimes clear or copy a block of memory by using the big SIMD registers, I guess it’s more efficient that way. But if you see things like vmul.f32 q10, q10, q0 (which is multiply, according to the NEON guide), then that’s doing some math on four floats at the same time: e.g. here it’s multipling four floats in q0 by four floats in q10 and store the result in q10.