SIMD on the MM is done with NEON.
There’s lot of info about it, ARM’s introduction is a good starting place: Documentation – Arm Developer
The simde library supports NEON, so it may be generating NEON SIMD instructions already. The best way to know for sure is to inspect the assembly and look for SIMD instructions
Harnessing SIMD is a complex topic. You have to take a data-orientated approach, that is, the layout of your data structures is critically important as to whether the compiler will be able to use SIMD or not. To add two groups of four floats, for example, all the floats need to be stored at a particular alignment, and also the groups of 4 must be consecutive in memory, in the correct order. The float_4 type will take care of keeping the floats together in memory, but the alignment thing is harder to make happen (the compiler flags for alignment are easy to get wrong and sometimes are silently ignored).
To check the assembly, I put in a cmake target to generate the disassembly. It takes a while to run and makes a file that might be several hundred megabytes, so be patient. If you want the file to be smaller (recommended!), modify your cmake file to only build the module in question.
Do this from the plugin’s project root:
cmake --build build -- debugdiss
In build/
you’ll see a file ending in *-debug.so.diss
. This can be opened in a text editor. Also related are *-debug.so.readelf
and *-debug.so.nm
which contain the result of readelf
and nm
. Handy for figuring out why a particular symbol is missing (or is included) and for checking the memory layout of the plugin.
Open it up and look for NEON instructions (the ARM site has the complete list). The quad instructions use the qX
registers. If you see just sX
registers, that’s just normal floating point (single-precision floats). If you see dX
registers, that could be due to using double
s or it could be due to dual-lane SIMD instructions.
For example, I just dissassembled Venom, and I scrolled through and saw this
vdupq_n_f32 (float32_t __a)
{
return (float32x4_t) {__a, __a, __a, __a};
37764: f3b40c40 vdup.32 q0, d0[0]
37768: f3f42c47 vdup.32 q9, d7[0]
this->b[0] = K * K * norm;
3776c: f4040adf vst1.64 {d0-d1}, [r4 :64]
37770: f3fc0c47 vdup.32 q8, d7[1]
return (float32x4_t) __builtin_neon_vmulfv4sf (__a, __b);
37774: f3444dd0 vmul.f32 q10, q10, q0
You don’t have to understand the assembly to know that this function is using SIMD. There’s lots of mentions of the q registers, including the last line which operates exclusively on q registers.
Note sometimes you will see the use of q or d registers, and it has nothing to do with SIMD or even math. The compiler will sometimes clear or copy a block of memory by using the big SIMD registers, I guess it’s more efficient that way. But if you see things like vmul.f32 q10, q10, q0
(which is multiply, according to the NEON guide), then that’s doing some math on four floats at the same time: e.g. here it’s multipling four floats in q0 by four floats in q10 and store the result in q10.