I'm currently developing a C-module for a Java-application that needs some performance improvements (see Improving performance of network coding-encoding for a background). I've tried to optimize the code using SSE-intrinsics and it executes somewhat faster than the Java-version (~20%). However, it's still not fast enough.
Unfortunately my experience with optimizing C-code is somewhat limited. I therefore would love to get some ideas on how to improve the current implementation.
The inner loop that constitutes the hot-spot looks like this:
for (i = 0; i < numberOfGFVectorsInFragment; i++) {
// Load the 4 GF-elements from the message-fragment and add the log of the coefficeint to them.
__m128i currentMessageFragmentVector = _mm_load_si128 (currentMessageFragmentPtr);
__m128i currentEncodedResult = _mm_load_si128(encodedFragmentResultArray);
__m128i logSumVector = _mm_add_epi32(coefficientLogValueVector, currentMessageFragmentVector);
__m128i updatedResultVector = _mm_xor_si128(currentEncodedResult, valuesToXor);
_mm_store_si128(encodedFragmentResultArray, updatedResultVector);
encodedFragmentResultArray++;
currentMessageFragmentPtr++;
}