Optimzing SSE-code

Question

I'm currently developing a C-module for a Java-application that needs some performance improvements (see Improving performance of network coding-encoding for a background). I've tried to optimize the code using SSE-intrinsics and it executes somewhat faster than the Java-version (~20%). However, it's still not fast enough.

Unfortunately my experience with optimizing C-code is somewhat limited. I therefore would love to get some ideas on how to improve the current implementation.

The inner loop that constitutes the hot-spot looks like this:

for (i = 0; i < numberOfGFVectorsInFragment; i++)   {

        // Load the 4 GF-elements from the message-fragment and add the log of the coefficeint to them.
        __m128i currentMessageFragmentVector = _mm_load_si128 (currentMessageFragmentPtr);
        __m128i currentEncodedResult = _mm_load_si128(encodedFragmentResultArray);

        __m128i logSumVector = _mm_add_epi32(coefficientLogValueVector, currentMessageFragmentVector);

        __m128i updatedResultVector = _mm_xor_si128(currentEncodedResult, valuesToXor);
        _mm_store_si128(encodedFragmentResultArray, updatedResultVector);

        encodedFragmentResultArray++;
        currentMessageFragmentPtr++;
    }

Have you looked at the assembly the compiler is producing? There will probably be some further gains to be made there. If you can't make the compiler perform them, do it manually and use the (well commented) .s as your source file rather than the .c — OrangeDog, Oct 17 '11 at 14:12
Did you have any compiler optimisations on at all when you generated that assembly? — OrangeDog, Oct 17 '11 at 19:21
@OrangeDog yup. I used the following build flags: cl /c /Zi /nologo- /Wall /WX- /Ox /Ob2 /Oi /Ot /Oy- /D WIN32 /D NDEBUG /D _WINDLL /D _UNICODE /D UNICODE /Gm- /EHsc /MD /GS /arch:SSE2 /fp:precise /Zc:wchar_t /Zc:forScope /Yc"StdAfx.h" /Fp"Release\NetworkCodingAccelerator.pch" /Fo"Release\\" /Fd"Release\vc100.pdb" /Gd /TC /analyze- /errorReport:prompt Stdafx.c Anything you think I should add? I'm a newbie when it comes to MSVC's compiler settings. — Yrlec, Oct 17 '11 at 21:55
If it's valid, try adding some `restrict` to see if it cuts down on all the MOVs. That, and following Mystical's tips about loop unrolling. — OrangeDog, Oct 18 '11 at 08:56
I'd suggest writing it in plain C and compiling with the Intel compiler. That will use sse/avx or whatever, unroll the loop, ... Best done with profile feedback. — Chris, Jan 07 '13 at 17:28

score 7 · Accepted Answer · answered Oct 17 '11 at 14:28

7

Even without looking at the assembly, I can tell right away that the bottleneck is from the 4-element gather memory access and from the _mm_set_epi32 packing operations. Internally, _mm_set_epi32, in your case will probably be implemented as a series of unpacklo/hi instructions.

Most of the "work" in this loop is from packing these 4 memory accesses. In the absence of SSE4.1, I would go so far to say that the loop could be faster non-vectorized, but unrolled.

If you're willing to use SSE4.1, you can try this. It might be faster, it might not:

    int* logSumArray = (int*)(&logSumVector);

    __m128i valuesToXor = _mm_cvtsi32_si128(expTable[*(logSumArray++)]);
    valuesToXor = _mm_insert_epi32(valuesToXor, expTable[*(logSumArray++)], 1);
    valuesToXor = _mm_insert_epi32(valuesToXor, expTable[*(logSumArray++)], 2);
    valuesToXor = _mm_insert_epi32(valuesToXor, expTable[*(logSumArray++)], 3);

I suggest unrolling the loop at least 4 iterations and interleaving all the instructions to give this code any chance of performing well.

What you really need is Intel's AVX2 gather/scatter instructions. But that's a few years down the road...

answered Oct 17 '11 at 14:28

Mysticial

464,885
45
335
332

Thanks, but your code yields the following error: Error C2275: '__m128i' : illegal use of this type as an expression. The error seems to be related to overwriting valuesToXor beceause if I remove the last three lines it compiles (the error message always points to the expression after the last line that set valuesToXor). – Yrlec Oct 17 '11 at 16:57
Did you include the SSE4.1 header ``? – Mysticial Oct 17 '11 at 17:32
1

@Mystical I've even seen the `_mm_set_` functions implemented by msvc as 4 copies into temporary aligned memory and then loaded from memory into the register, and I could only shake my head when seeing this. – Christian Rau Oct 17 '11 at 17:54
@Mystical Yes. For instance if I write this: __m128i valuesToXor = _mm_cvtsi32_si128(expTable[*(logSumArray++)]); __m128i valuesToXor2 = _mm_insert_epi32(valuesToXor, expTable[*(logSumArray++)], 1); it compiles but if I replace it with: __m128i valuesToXor = _mm_cvtsi32_si128(expTable[*(logSumArray++)]); valuesToXor = _mm_insert_epi32(valuesToXor, expTable[*(logSumArray++)], 1); the line after it creates the error message by the compiler. – Yrlec Oct 17 '11 at 17:56
@Christian Rau: I've seen that before too... it's hilarious... XD – Mysticial Oct 17 '11 at 17:56
Hmm... that's strange... I suppose renaming the variable should still work as a work-around since the compiler will merge them anyway. But I am wondering why it isn't working... – Mysticial Oct 17 '11 at 18:07
I just tested this and it compiles on VS2010: `__m128i valuesToXor = _mm_cvtsi32_si128(1); valuesToXor = _mm_insert_epi32(valuesToXor,10, 1); valuesToXor = _mm_insert_epi32(valuesToXor,20, 2); valuesToXor = _mm_insert_epi32(valuesToXor,30, 3);` So I'm also at a loss to what's going on... What compiler are you using? – Mysticial Oct 17 '11 at 18:59
@Mysticial I've solved it now. I had messed up some code above which led to this. Didn't improve the performance though. :( Thanks anyway! – Yrlec Oct 17 '11 at 21:56
Did you try unrolling the loop? The SSE4.1 example I gave has a nasty dependency on `valuesToXor`. Furthermore, `_mm_insert_epi32` has a high latency. So you'll need to unroll to get the benefit (if it exists). – Mysticial Oct 17 '11 at 21:59
I've tried unrolling and it provided some performance improvement (~5%-10%) but I'll have to continue trying to find the bottlenecks. – Yrlec Oct 18 '11 at 10:54

score 1 · Answer 2 · answered Oct 17 '11 at 18:40

1

Maybe try http://web.eecs.utk.edu/~plank/plank/papers/CS-07-593/. The functions with "region" in their names are supposedly fast. They don't seem to use any kind of special instruction sets, but maybe they've been optimized in other ways...

answered Oct 17 '11 at 18:40

David

161
2
6

Thanks! They seem to employ many similar tricks as I did (like the double log-tables to avoid the extra modulo). However I'll check the region-code to see if they have some other optimizations as well. – Yrlec Oct 17 '11 at 21:57
After reading the docs I actually think that their code is slower. They achieve ~167 MB/s when multiplying and XOR-ing into a region. I achieve about 1GB/s for the same operation (I probably have a faster CPU but the difference is still quite big). The problem is that I'm doing that >1000/s so the actual output throughput is only ~1Mb/s. – Yrlec Oct 18 '11 at 07:28

Optimzing SSE-code

2 Answers2

Linked