0

In normal C++ we can dynamically allocate an array of floats using either the standard library function malloc or the new keyword. When we think of SIMD vectors, that come as compiler extensions like float32x4_t (for ARM neon), is it safe to dynamically allocate an array of such SIMD vectors like this:

uint32_t number_req = 32; 
float32x4_t *simd_arr = (float32x4_t *)malloc(sizeof(float32x4_t) * number_req); 

I'm trying to limit the amount of load store instructions in my code. If the above is not a legit method, then what is the proper way to implement it? Every help will be greatly appreciated! Thankyou very much in advance!

Vivekanand V
  • 340
  • 2
  • 12
  • 1
    You can do so (with the caveat that it must be properly aligned, but I won't focus on that). However if your array is really big it won't fit into registers and you won't see any performance gain. But this would also be true of a very large array whose size is statically known. When you write a function which operates on such an array, you should use prefetch and examine the generated assembly (and profile) to determine if your code is doing what you expect. – user2407038 Jun 21 '21 at 02:28
  • Generally, it will be safe to dynamically allocate (arrays of) types like `float32x4_t` as long as you don't fiddle with alignment. In C++, it is usually preferable to use a `new` expression rather than `malloc()` unless relevant documentation (e.g. for your compiler, or - in your case - the ARM development guide) says otherwise. Whether that affects load and store instructions or not is something you'll need to check in another way (e.g. by examining assembler output by compiler). – Peter Jun 21 '21 at 02:55

3 Answers3

2

I'm trying to limit the amount of load store instructions in my code.

Reducing the number of load/store intrinsics in your code this way won't help with that.

Dereferencing a float32x4_t* is exactly equivalent to a load or store intrinsic, and in fact probably how the 1-vector aligned-load intrinsic is implemented.

It's up to the compiler when it can keep a vector type in a vector register, just like for keeping an int object in a normal integer register.

Load/store intrinsics mostly exist for communicating alignment to the compiler, and keeping it happy about types; look at the compiler-generated asm to see what's really happening.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Thankyou for your reply! Presently, I have a normal dynamically allocated ```float``` array and to use the intrinsic functions I have to call ```vld1q_f32(...)``` (load float32x4_t from float array, for ARM NEON) everytime to do an operation, say, vector dot product, and then I have to store the result using ```vst1q_f32()``` to the result array (traditional C++ ```float``` array). Is there any work around to limit this incessant calling of functions relating to load / store operations!? – Vivekanand V Jun 21 '21 at 03:18
  • 1
    @VivekanandV: The only viable solution on current architectures is to do more work in each pass over the array, while the data is already live in a register (local variable). e.g. if you wanted all combos of dot products between three vectors, you'd make one pass that loaded each element once, not 3 passes for 3 separate dot products. Or combine with the pass generating it. The amount of ALU work per load/store (or per time data is pulled into L1d cache) is called "computational intensity", and increasing it is key to performance on CPUs with powerful ALUs compared to their memory bandwidth. – Peter Cordes Jun 21 '21 at 03:41
  • 1
    (Not to mention the actual loads/stores taking pipeline slots). Basically, until we have computational memory or similar ALUs that scale with memory, the **[Von Neumann Bottleneck](https://wiki.c2.com/?VonNeumannBottleneck)** is not something you can avoid just by playing tricks with C types. Just like `arr[i] += 1;` will require the compiler to emit load and store instruction, so will doing the same thing using intrinsics. – Peter Cordes Jun 21 '21 at 03:44
  • @PeterCordes oof didnt see arm :/ my bad! Delete my comments as they are totally irrelivent to this Q/A: https://stackoverflow.com/questions/52147378/choice-between-aligned-vs-unaligned-x86-simd-instructions – Noah Jun 21 '21 at 15:39
  • @PeterCordes put it [here](https://stackoverflow.com/questions/52147378/choice-between-aligned-vs-unaligned-x86-simd-instructions/52153970#comment120312347_52153970) – Noah Jun 21 '21 at 16:32
2

You probably want aligned_alloc, which was introduced in C11 as a replacement of malloc for cases like this.

MSalters
  • 173,980
  • 10
  • 155
  • 350
0

or memalign() which is available in all Linux libc libraries