Wrong result in vectorization with SSE

Question

The code below generates the following output:

6 6 0 140021597270387

which means that only the first two positions are calculated correctly. However, I am dealing with longs (4 bytes) and __m128i can hold more than 4 longs.

long* AA = (long*)malloc(32*sizeof(long));
long* BB = (long*)malloc(32*sizeof(long));

for(i = 0; i<4;i++){
    AA[i] = 2;
    BB[i] = 3;
}

__m128i* m1 = (__m128i*) AA;
__m128i* m2 = (__m128i*) BB;

__m128i m3 = _mm_mul_epu32(m1[0],m2[0]);

long* CC = (long*) malloc(16 * sizeof(long));
CC = (long*)&m3;

for (i = 0; i < 4; i++)
    printf("%ld \n",CC[i]);

To allocate:

long* AA = (long*) memalign(16 * sizeof(long), 16);

(and the remaining vectors) generates a seg. fault. Can somebody comment?

Thanks

Why are you allocating `CC` and then immediately assigning the address of `m3` to it? — Jonathon Reinhart, Jan 28 '14 at 22:15
`__m128i` may or may not hold four longs. The size of `long` is implementation specific, and may be 32-bits or larger. On many 64-bit architectures, `long` is actually 64-bits, so `__m128i` may only hold two longs. You should check that `sizeof(__m128i) == 4*sizeof(long)`. — sfstewman, Jan 28 '14 at 22:23
Related: [SSE multiplication of four 32-bit integers](http://stackoverflow.com/q/10500766/183120) — legends2k, Jan 29 '14 at 09:44
on Windows long is 32 bit but most 64-bit Unix-like systems have 64-bit long — phuclv, Jan 29 '14 at 10:29

Paul R · Answer 1 · 2014-01-29T09:24:07.297

4

1) don't use an indeterminate-sized type like long, use a specific fixed with type such as uint32_t

2) don't use malloc - it's not guaranteed to return 16 byte aligned memory, use memalign or equivalent*

3) don't cast the result of malloc (or any other function return void *) in C

4) no need to allocate yet another buffer just to print results

Fixed code:

uint32_t* AA = memalign(32*sizeof(uint32_t), 16);
uint32_t* BB = memalign(32*sizeof(uint32_t), 16);

for (i = 0; i < 4; i++){
    AA[i] = 2;
    BB[i] = 3;
}

__m128i* m1 = (__m128i*)AA;
__m128i* m2 = (__m128i*)BB;

__m128i m3 = _mm_mul_epu32(m1[0], m2[0]);    // 2 x 32x32->64 bit unsigned multiplies -> m3

uint64_t* CC = (uint64_t*)&m3;

for (i = 0; i < 2; i++)                      // display 2 x 64 bit result values
    printf("%llu\n", CC[i]);

*Note that, depending on your platform, you may need to use a call other than memalign in order to allocate suitably aligned memory, e.g. posix_memalign, _mm_malloc or _aligned_malloc (WIN32).

edited Jan 29 '14 at 09:24

answered Jan 28 '14 at 22:31

Paul R

208,748
37
389
560

1

It's worth noting that `posix_memalign` isn't cross-platform (no Windows). It also isn't called this way: http://pubs.opengroup.org/onlinepubs/007904975/functions/posix_memalign.html – sfstewman Jan 28 '14 at 22:42
@PaulR: `_mm_mul_epu32` is documented as only two multiplications _a0 * b0_ and _a2 * b2_. Will a single call multiply 4 `int32`s? – legends2k Jan 29 '14 at 08:42
@legends2k: Indeed, `_mm_mul_epi32`/`_mm_mul_epu32` perform 2 x 32x32->64 bit int multiplies. If you want 4 x 32x32->32 bit multiplies then you can either use several 16 bit multiplies to put this together, or perhaps 2 x `_mm_mul_epu32` and some shifting/shuffling, or if accuracy is not crucial then convert to float, use `_mm_mul_ps`, and convert back to int. There is no single instruction for this. – Paul R Jan 29 '14 at 08:53
1

@PaulR: In the above example, you're packing four `int32_t`s in the registers and calling `_mm_mul_epu32` once whose result is two `int64_t` while you access each as an `int32` four times to print the output. Shouldn't you be calling `_mm_mul_epu32` twice? – legends2k Jan 29 '14 at 09:05
Yes, sorry - I was just fixing the fundamental problems in the OPs code (i.e. crashes) and not worrying too much about the finer detail. I'll tidy the display part up too. – Paul R Jan 29 '14 at 09:21
@PaulR: No problem; in fact, my queries were only after reading [your answer for a related question](http://stackoverflow.com/a/10501533/183120). – legends2k Jan 29 '14 at 09:45
@legends2k: good catch - I'd forgotten about `_mm_mullo_epi32(a, b)` in SSE4.1 - that might be useful for the OP. – Paul R Jan 29 '14 at 12:13
1) The reason I was using longs it because I need to use SSE instructions in my code which is full of vectors of longs. Assuming that I want to do these operations with AA and BB and save the result in CC, what is the most correct cast? 2) How can I multiply four longs at the same time? It is possible to use the 128bit width of the registers at the same time? The best possible speedup in that code is a factor of 2x, right? – a3mlord Jan 29 '14 at 15:31
Just realized that in my case sizeof(long) is 8, so I can't pack more that two longs in one m128i register... – a3mlord Jan 29 '14 at 15:39
OK - you're not going to see much benefit with 64 bit data on SSE anyway, so you might as well abandon this idea, unless you're thinking of taking it to AVX eventually? – Paul R Jan 29 '14 at 16:42
That is fine- I do have a AVX-capable processor. Can you please point out some links where I can learn how to vectorize code with AVX instructions? Anyways, can you please answer my previous 2 questions? Thanks! – a3mlord Jan 29 '14 at 19:00
@a3mlord The correct cast is no cast: use unions. Casting in the way that you're doing it is called type punning which breaks aliasing rules and is technically forbidden by the C99/C11 standard. Many compilers support it, but then cannot emit optimal code. – sfstewman Jan 29 '14 at 19:19
@a3mlord: you would need AVX2 for integer work, so Haswell/Broadwell or later. All the docs are on http://developer.intel.com. Be sure to check out the Intel Intrinsics Guide, which is quite comprehensive. I think the docs cover all your questions, but if you have any specific questions remaining then start a new question here on StackOverflow (lengthy discussions in comments are discouraged). – Paul R Jan 29 '14 at 20:03
I'll use Ivy Bridge, so I guess that I have AVX only. Assuming that I continue with this SSE2 instructions, I guess that I'll have to use unions. However, I didn't find any good explanation about their use in vectorized code. The Intel Intrinsics Guide doesn't have examples either. Any site that you can recall from the top of your head? I am a software guy and I've been using vectorized functions via libraries only... – a3mlord Jan 29 '14 at 23:00
@a3mlord: You have SSE4.2 and AVX on Ivy Bridge. There are quite a few good questions and answers here on SO about using SSE and AVX (see e.g. http://stackoverflow.com/questions/10500766/sse-multiplication-of-4-32-bit-integers/10501533#10501533). Unions are useful for converting between scalar and vector data but you shouldn't need them too often - the intrinsics cover 99% of use cases. If you have specific questions then post them on SO with a `simd` and/or `sse` tag and either myself or Mysticial or one of the other SIMD people will probably give you advice. – Paul R Jan 29 '14 at 23:26
@Paul R., with "only" I meant to say that I don't have AVX2, only AVX. My biggest problem is definitely the cast between, say, a vector of longs and this __m128i data type. Is is shown in the link that you just posted? Thanks! – a3mlord Jan 30 '14 at 10:54
As I said previously, if you really need 64 bit integers, then SSE is not going to be much help - there are very few 64 bit arithmetic operations, and even where there are, you only get two operations per instruction. Given that there are two or more scalar ALUs on most modern CPUs there is really nothing to be gained from doing 64 bit integer SIMD at 128 bits anyway. AVX2 *might* be a different story, but you only have AVX currently, so that doesn't help either. – Paul R Jan 30 '14 at 11:14
I am sorry for pushing this further, but I am still a little bit confused. I have longs in my code, which are 64 bit integers. I can pack two of these in each SIMD register, and therefore expect a speedup of 2 times, right? Or are you saying that I will get nothing because this operation is already made in one clock cycle (using two scalar ALUs)? If so, can't I pack these in a SIMD register and use the scalar ALUs to other operations, such as, say, two more positions of my vectors and thereby expect a speedup of 2 times? I appreciate your patience. – a3mlord Jan 30 '14 at 16:08
That's correct, but more importantly see my other comment about lack of 64 bit arithmetic instructions in SSE. You can get two 64 bit longs into a vector, but you can't do much with the data once it's there, other than add/subtract. – Paul R Jan 30 '14 at 16:10
I only need to add, multiply and compare. Don't I have all of those? BTW, what would be the consequences of changing my code to use ints (32b) instead of longs (64b)? Could I then expect a speedup of 4x? – a3mlord Jan 30 '14 at 19:10
Nope - just add and subtract at 64 bits. If you really only need the range of 32 bit ints (in which case why are you using 64 bit longs anyway ???) then yes, you're back in business, and you can often realise 4x (or sometimes more) throughput improvement over scalar code if you know what you're doing. – Paul R Jan 30 '14 at 19:30

Wrong result in vectorization with SSE

1 Answers1