I try to work with SSE and i faced with some strange behaviour.
I write simple code for comparing two strings with SSE Intrinsics, run it and it work. But later i understand, that in my code one of pointer still not aligned, but i use _mm_load_si128
instruction, which requires pointer aligned on a 16-byte boundary.
//Compare two different, not overlapping piece of memory
__attribute((target("avx"))) int is_equal(const void* src_1, const void* src_2, size_t size)
{
//Skip tail for right alignment of pointer [head_1]
const char* head_1 = (const char*)src_1;
const char* head_2 = (const char*)src_2;
size_t tail_n = 0;
while (((uintptr_t)head_1 % 16) != 0 && tail_n < size)
{
if (*head_1 != *head_2)
return 0;
head_1++, head_2++, tail_n++;
}
//Vectorized part: check equality of memory with SSE4.1 instructions
//src1 - aligned, src2 - NOT aligned
const __m128i* src1 = (const __m128i*)head_1;
const __m128i* src2 = (const __m128i*)head_2;
const size_t n = (size - tail_n) / 32;
for (size_t i = 0; i < n; ++i, src1 += 2, src2 += 2)
{
printf("src1 align: %d, src2 align: %d\n", align(src1) % 16, align(src2) % 16);
__m128i mm11 = _mm_load_si128(src1);
__m128i mm12 = _mm_load_si128(src1 + 1);
__m128i mm21 = _mm_load_si128(src2);
__m128i mm22 = _mm_load_si128(src2 + 1);
__m128i mm1 = _mm_xor_si128(mm11, mm21);
__m128i mm2 = _mm_xor_si128(mm12, mm22);
__m128i mm = _mm_or_si128(mm1, mm2);
if (!_mm_testz_si128(mm, mm))
return 0;
}
//Check tail with scalar instructions
const size_t rem = (size - tail_n) % 32;
const char* tail_1 = (const char*)src1;
const char* tail_2 = (const char*)src2;
for (size_t i = 0; i < rem; i++, tail_1++, tail_2++)
{
if (*tail_1 != *tail_2)
return 0;
}
return 1;
}
I print alignment of two pointers and one of this wal aligned but second - wasn't. And program still running correctly and fast.
Then i create synthetic test like this:
//printChars128(...) function just print 16 byte values from __m128i
const __m128i* A = (const __m128i*)buf;
const __m128i* B = (const __m128i*)(buf + rand() % 15 + 1);
for (int i = 0; i < 5; i++, A++, B++)
{
__m128i A1 = _mm_load_si128(A);
__m128i B1 = _mm_load_si128(B);
printChars128(A1);
printChars128(B1);
}
And it crashes, as we expected, on first iteration, when try load pointer B.
Interesting fact that if i switch target
to sse4.2
then my implementation of is_equal
will crash.
Another interesting fact that if i try align second pointer instead of first (so first pointer will be not aligned, second - aligned), then is_equal
will crash.
So, my question is: "Why is_equal
function works fine with only first pointer aligned if i enable avx
instruction generation?"
UPD: This is C++
code. I compile my code with MinGW64/g++, gcc version 4.9.2
under Windows, x86.
Compile string: g++.exe main.cpp -Wall -Wextra -std=c++11 -O2 -Wcast-align -Wcast-qual -o main.exe