Are these two for loops equivalent?

Question

I working with a convolution and, in particular, I'm trying to speedup its execution. To obtain this acceleration I'm using a SIMD instruction in order to perform two multiplication at the same time where the result of one is put in the 32 higher bit of a 64 bit variable while the other result is in 32 lower bit. The problem is that the new code seems not working as the old one.

The initial code contains this for-loop

 int32_t v32;
 int16_t arr_2[1024];
 int16_t data[96];
 int32_t accu;
  ...
        for(int j=0; j<INPUT_F; j++){
          v32 = arr_2[l*OUT_F+j]*data[k*K*INPUT_F+(l-i+K/2)*INPUT_F+j]
          accu += v32;
        }
  ...

the questions is: apart for the multiplication functions, are the other operations equivalent or am I doing something wrong ?

 uint64_t v64; 
 int16_t arr_2[1024];
 int16_t data[96];
 int32_t accu;
 ...
        for(int j=0; j<INPUT_F/2; j++){
          v64 = __mul(arr_2[l*OUT_F+2*j],data[k*K*INPUT_F+(l-i+K/2)*INPUT_F+2*j]); //use a simd instruction to perform mul between two consecutive values in the arrays.
          accu += ((int32_t)(v64 & 0xFFFFFFFF); //first value
          accu += ((int32_t)((v64 >> 32) & 0xFFFFFFFF); //second value
        }
 ...

__mul() is defined as uint64_t __mul(uint32_t a, uint32_t b); and even if the operands are uint32_t it takes into account the fact that there are two int16_t values internally.

Please start by de-obfuscating that moster expression by using temporary variables and splitting the code into multiple expressions. — Lundin, May 17 '23 at 12:13
You said "*Use a simd instruction to perform mul between two consecutive values in the two arrays.*" and "*__mul() is defined as `uint64_t __mul(uint32_t a, uint32_t b);`*". These statements are contradictory. — ikegami, May 17 '23 at 12:28
`__mul(array_2[l][2*j],weights[k][l-i+CONV_K/2][2*j])` cannot possibly multiply two pairs of `int16_t` values because, given the code shown, `array_2[l][2*j]` is a single `int16_t` value and `weights[k][l-i+CONV_K/2][2*j]` is a single `int16_t` value. If the macro took the addresses of these elements and used those addresses to access two elements at each location, then maybe it could be work, but the definition of the macro that you showed does not do that. — Eric Postpischil, May 17 '23 at 12:29
Well, I would have said that `__mul(array_2[l][2*j],weights[k][l-i+CONV_K/2][2*j])` cannot possibly multiple two pairs of `int16_t` values because it is not valid C, given `array_2` and `weights` being declared as (1D) arrays of `int16_t`. — John Bollinger, May 17 '23 at 12:33
@EricPostpischil I thought that having declared `a` and `b` as uint32_t when I pass the with that index it would take 32 consecutive bits (that's why I used 2*j) — Dresult, May 17 '23 at 12:39
We appreciate you trying to present a minimal example, but that's useful only if it is *representative* of the problem. If the code presented does not even compile, then I can be confident that you have not verified that. — John Bollinger, May 17 '23 at 12:40

Eric Postpischil · Accepted Answer · 2023-05-17T14:42:36.410

[From a comment] I thought that having declared a and b as uint32_t when I pass the with that index it would take 32 consecutive bits (that's why I used 2*j)

Functions do not “take” things from the environment where they are called.

When a parameter has type uint32_t, that means an argument passed for that parameter will be converted to the type uint32_t. It does not mean 32 bits will be pulled from wherever the argument comes from.

In C, expressions are formed from subexpressions and their operands, and each operand and subexpression is evaluated based on its type, not the type of the enclosing expression.

In __mul(array_2[l*OUT_FEA+2*j],weights[k*CONV_K*INPUT_FEA+(l-i+CONV_K/2)*INPUT_FEA+2*j]), array_2[l*OUT_FEA+2*j] has type in16_t because array_2 is declared an array of int16_t elements. So the index l*OUT_FEA+2*j is calculated and used to look up an element in the array. That single int16_t element is taken and is passed for the a parameter of __mul. Since that parameter has type uint32_t, the single int16_t value is converted to the type uint32_t.

Nothing in this code causes two elements of array_2 to be fetched or used.

These are fundamental aspects of C, and it is futile to attempt SIMD programming in C without understanding them.

To pass to __mul a uint32_t value that contains the bits of two int16_t elements, you must fetch two int16_t elements. There are multiple ways to do this in C. One would be to fetch two elements (by writing them as separate operands in an expression) and combine them using conversions and bit-shifting. However, when we are trying to accelerate performance using SIMD, we generally want to avoid separate fetches of separate elements. (Optimization by the compiler might combine separate fetches into a single fetch, but relying on this requires additional knowledge and considerations beyond the scope of this answer.)

To that end, it is common in SIMD code to access an array of int16_t elements using an lvalue of type uint32_t. However, this requires additional considerations of the rules of C, notably rules about aliasing types and about alignment. It is necessary to ensure that array_2 and weights are correctly aligned for the uint32_t type (or that we write code that adapts to whatever alignment they have) and that either we make arrangements to alias the array using the uint32_t in accordance with the rules of the C compiler or the compiler provides assurances beyond the C standard that it supports the aliasing.

Explaining these things goes beyond the scope of a simple Stack Overflow answer. They are prerequisites that should be learned when or before starting SIMD programming.

Most real-world SIMD intrinsic APIs have special SIMD types like x86 `__m128i`, and load functions like `_mm_loadu_si128( (__m128i*) &array_2[stuff] )` to load 16 bytes from that address, with a strict-aliasing-safe unaligned load. (See also [Is \`reinterpret\_cast\`ing between hardware SIMD vector pointer and the corresponding type an undefined behavior?](https://stackoverflow.com/q/52112605)). Similarly, ARM NEON intrinsics have types like `int16x4_t` (8-byte vector) and `int16x8_t` (16-byte vector). — Peter Cordes, May 17 '23 at 21:03
So intrinsic APIs give you tools to avoid needing `memcpy` for doing your own aliasing-safe unaligned loads, or GNU C `__attribute__((aligned(1),may_alias))` typedefs for loading a `uint32_t`. — Peter Cordes, May 17 '23 at 21:03

Are these two for loops equivalent?

1 Answers1