Performance with SSE is the same

Question

I vectorized the following loop, that crops up in an application that I am developing:

void vecScl(Node** A, Node* B, long val){

    int fact = round( dot / const);

    for(i=0; i<SIZE ;i++)
        (*A)->vector[i] -= fact * B->vector[i];

}

And this is the SSE code:

void vecSclSSE(Node** A, Node* B, long val){

    int fact = round( dot / const);

    __m128i vecPi, vecQi, vecCi, vecQCi, vecResi;

    int sseBound = SIZE/4;

    for(i=0,j=0;  j<sseBound  ; i+=4,j++){

        vecPi = _mm_loadu_si128((__m128i *)&((*A)->vector)[i] );
        vecQi = _mm_set_epi32(fact,fact,fact,fact);
        vecCi = _mm_loadu_si128((__m128i *)&((B)->vector)[i] );
        vecQCi = _mm_mullo_epi32(vecQi,vecCi);
        vecResi = _mm_sub_epi32(vecPi,vecQCi);               
        _mm_storeu_si128((__m128i *) (((*A)->vector) + i), vecResi );

    }

    //Compute remaining positions if SIZE % 4 != 0 
    for(; i<SIZE ;i++)
        (*A)->vector[i] -= q * B->vector[i];

}

While this works in terms of correctness, the performance is exactly the same with and without SSE. I am compiling the code with:

 g++ *.cpp *.h -msse4.1 -march=corei7-avx -mtune=corei7-avx -mno-avx -mno-aes -Warray-bounds -O2

Is this because I am not allocating (and use the SSE functions accordingly) aligned memory? The code is very complicated to change, so I was kind of avoiding that for now.

BTW, in terms of further improvements, and considering that I am bounded to the Sandy Bridge architecture, what is the best that I can do?

EDIT: The compiler is not vectorizing the code yet. First, I changed the data types of the vectors to shorts, which doesn't change performance. Then, I compiled with -fno-tree-vectorize and the performance is the same.

Thanks a lot

maybe the compiler already vectorized the code, check the asm code. — concept3d, Mar 19 '14 at 15:33
I can't, I am compiling multiple .c and .cpp files. Apparently it works fine when you're compiling one file only. I am also sure that it is not vectorizing the code, because if I change the data type to short (instead of int), the performance would be necessarily better, and it ain't. — a3mlord, Mar 19 '14 at 15:37
BTW, another proof that it is not vectorizing the code is that compiling with -fno-tree-vectorize doesn't change performance... — a3mlord, Mar 19 '14 at 15:44
I don't know much about how to write vectorized code, but what I know is that aligning is crucial. The input array needs to be aligned to some larger size, which (I guess) can be expressed using attributes. — leemes, Mar 19 '14 at 15:45
@a3mlord You can check disassembly of any file, any function and even expression, in any modern IDE. Without code produced by your compiler there is not much to discuss. Why do you think that shorts "would be necessarily better"? How do you profile performance? — Ivan Aksamentov - Drop, Mar 19 '14 at 15:47
@leemes: alignment is not too important with recent Intel CPUs (Ivy Bridge, Sandy Bridge, Haswell, et al). It can still make a small difference though. — Paul R, Mar 19 '14 at 15:52
I'm surprised that g++ will let you divide by `const` at all. — Damon, Mar 19 '14 at 17:51
You're right, thats not the real name of the variable... It is called const1. — a3mlord, Mar 20 '14 at 12:08

Paul R · Answer 1 · 2014-03-19T17:00:06.833

3

If your data is large then you may just be memory-bound, since you are doing very few ALU operations per load/store.

However there are a few minor improvements you can try:

inline void vecSclSSE(Node** A, Node* B, long val){
                                            // make function inline, for cases where `val` is small

    const int fact = (dot + const / 2 - 1) / const;
                                            // use integer arithmetic here if possible

    const __m128i vecQi = _mm_set1_epi32(fact);
                                            // hoist constant initialisation out of loop

    int32_t * const pA = (*A)->vector;      // hoist invariant de-references out of loop
    int32_t * const pB = B->vector;

    __m128i vecPi, vecCi, vecQCi, vecResi;

    for(int i = 0; i < SIZE - 3; i += 4) {   // use one loop variable
        vecPi = _mm_loadu_si128((__m128i *)&(pA[i]));
        vecCi = _mm_loadu_si128((__m128i *)&(pB[i]));
        vecQCi = _mm_mullo_epi32(vecQi,vecCi);
        vecResi = _mm_sub_epi32(vecPi,vecQCi);
        _mm_storeu_si128((__m128i *)&(pA[i]), vecResi);
    }

    //Compute remaining positions if SIZE % 4 != 0
    for(; i<SIZE ;i++)
        pA[i] -= q * pB[i];

}

edited Mar 19 '14 at 17:00

answered Mar 19 '14 at 15:49

Paul R

208,748
37
389
560

Hi Paul, thanks for your answer, I wondered when you would jump in. My "SIZE" varies from 20 to 150. BTW, how critical you think it is to allocate aligned memory? Is it the root cause of the problem, in your opinion? Thanks. – a3mlord Mar 19 '14 at 15:52
OK - your data is very small for this kind of optimisation, so you may want to look at reducing overheads. Make the function `inline` and get rid of the `round()` function call - use integer arithmetic for this instead. Use aligned memory if you can, but as you're using Sandy Bridge it's not going to make a huge difference. – Paul R Mar 19 '14 at 15:54
@PaulR, I aim at getting performance when SIZE > 100. In fact, I want SIZE to be as high as possible, but it takes days to compute SIZE = 150, and the whole application has exponential complexity. – a3mlord Mar 19 '14 at 16:17
@AlexFarber, authors reported (in double-reviewed papers at reputable conferences) gains of 2.7x when vectorizing this kernel for shorts (instead of ints). At this point, this is all but "premature optimization". – a3mlord Mar 19 '14 at 16:18
@PaulR, I don't get your transformation regarding the loop variable. Can you clarify that please? – a3mlord Mar 19 '14 at 16:30
The single loop variable `i` iterates 4 elements at a time. The test for completion is `i < val - 3` so it will terminate when there are between 0 and 3 elements remaining to be processed (and the scalar loop at the end will handle these, if there are > 0 remaining). – Paul R Mar 19 '14 at 16:43
But whats the value of "val"? You meant SIZE, right? – a3mlord Mar 19 '14 at 16:54
1

Yes - my bad - typo on my part - I'll fix it. – Paul R Mar 19 '14 at 16:59
Dear AlexFarber, if you look at my last answer (below) you'll find out that it is worthwhile to vectorize small vectors (my preliminary benchmarks include SIZE=50 and 60 only). – a3mlord Mar 20 '14 at 13:07
I hope `inline` doesn't matter. If I care about performance, I don't want a compiler that can't inline properly. – R. Martinho Fernandes Mar 20 '14 at 15:54

score 1 · Answer 2 · edited May 23 '17 at 12:28

As Paul said, you have relavively few computations per data access and your code is probably IO bound. Since unaligned stores/loads are slower than aligned ones, you really should align your data.

You should align on 16 bytes with SSE which is also a cache line, and (I think) 32 with AVX. If you allocated your data yourself, just use _aligned_alloc. If you're using std::vector the easiest way to align is use a custom allocator instead of std::allocator. This allocator would call _aligned_alloc or something like that instead of malloc/new. See also this question.

And then you could switch to aligned instructions for load/stores.

Also, I'm not sure what's the code generated by &((*A)->vector)[i], better use a local pointer to store the data, but be sure to annotate it __restrict

But before going into all this, be sure it's worth your time and the maintainance burden. You can profile with oprofile for linux or AMD's code analyst for windows.

I am not using anything that comes from std::. I'll do what you're saying - at this point is sound inevitable. — a3mlord, Mar 19 '14 at 16:13
Yeah, sorry, got confused with your `vector`... Be sure to profile though. Added some links in answer. — Antoine, Mar 19 '14 at 16:19

a3mlord · Answer 3 · 2014-03-20T15:51:26.897

I'd like to say that, for the same SIZE, I was able to vectorize a kernel that takes place right before the one in the first post. This time, I had great speedups (I won't say the factor because it is irrelevant unless I quantify the time spent with the kernel in whole application). The kernel computes the dot product of two vectors, i.e.:

for(i=0;i<SIZE;i++)
    dot += A->vector[i] * B->vector[i];

From this I can conclude that the SIZE being small is not a problem. This suggests, in turn, that I might be doing wrong in the first kernel. Can somebody suggest a different set of SSE operations for the first kernel? I think it is worthwhile to try it. Next step is to allocate aligned memory but as mentioned before, this ain't crucial in Sandy Bridge and other new architectures.

This proved once again that the compiler was NOT vectorizing the code.

Thanks

Performance with SSE is the same

3 Answers3

Linked