4

So I am trying to use the SSE function __mm_load_128, I am very new to SSE fo forgive me if I have made some silly mistakes somewhere.

Here is the code

void one(__m128i *arr, char *temp)
{
    // SSE needs 16 byte alignment.
    _declspec (align(16)) __m128i *tmp = (__m128i*) temp;

    if (((uintptr_t)tmp & 15) == 0)
        printf("Aligned pointer");
    else 
        printf("%d", ((uintptr_t)tmp & 15)); // This prints as 12

    arr[0] = _mm_load_si128(tmp);
}

I get an error on visual studio 0xC0000005: Access violation reading location 0xFFFFFFFF.

0xFFFFFFFF does not look right, what am I doing wrong.

arr argument is initialized as _m128i arr[5] = { 0 }

Alternative would be to use _mm_loadu_128 which works fine but as I understand it, It should produce movdqu instruction but this is the assembly generated

    arr[0] = _mm_loadu_si128(tmp);
00D347F1  mov         eax,dword ptr [tmp]  
00D347F4  movups      xmm0,xmmword ptr [eax]  
00D347F7  movaps      xmmword ptr [ebp-100h],xmm0  
00D347FE  mov         ecx,10h  
00D34803  imul        edx,ecx,0  
00D34806  add         edx,dword ptr [arr]  
00D34809  movups      xmm0,xmmword ptr [ebp-100h]  
00D34810  movups      xmmword ptr [edx],xmm0 

Thanks guys, From the answers I realize I have made couple of mistakes.

  1. Align the source use _alinged_malloc

  2. Compile with optimizations.

  3. Use C++ casts not C

qwn
  • 347
  • 3
  • 15
  • 4
    You are not doing anything to align source address. You are trying to align pointer without changing the stored address. You should ensure that buffer pointed to by `temp` is properly aligned instead. Also you should avoid C-style casts in C++ code. – user7860670 Oct 06 '17 at 05:47
  • 2
    @VTT: `(__m128i*)` is standard style for intrinsics. `reinterpret_cast<__m128i*>(temp)` is generally too bulky, and Intel's intrinsics have long enough names already. (Also, `__m128i` is allowed to alias any other type, so it's special.) – Peter Cordes Oct 06 '17 at 06:08
  • 1
    Don't compile with optimizations disabled if you want sane looking asm output. Emitting `imul edx, ecx, 0` is just a hilariously bad way to zero `edx`. It is doing a `movups` load, but then spilling to a temporary on the stack and reloading (with aligned loads) before storing again to where it actually wants the result. Like I said, enable optimizations if you want the asm to look like what you imagine when writing the C++. – Peter Cordes Oct 06 '17 at 06:12
  • 2
    The sad part here is that it *won't* crash on newer versions of MSVC and ICC because they will often use unaligned accesses anyway. So instead of crashing, you get hard-to-detect performance hits instead. Unfortunately, the MSVC devs said this is by design. – Mysticial Oct 06 '17 at 15:28

1 Answers1

8

I can see three problems here:

  1. this code is strictly C, and not C++. it's not a problem per se, but the question is tagged as C++.
  2. You don't distinguish between the alignment of the pointer vs the alignment of what the pointer points to
  3. when the code flow is inside one, it's impossible to change the alignment of arr or temp.

Let's focus on point number 2 for a second - there's a pointer, and there's what the pointer points to. I guess you already know the difference between these two.

basically , when you write _declspec (align(16)) __m128i *tmp you tell the program:

When you allocate the pointer tmp on the stack, make sure the the first byte of tmp is allocated on an address (on the stack) which is dividable by 16.

So great, tmp itself is aligned to 16, it doesn't affect at all what tmp points to. you need temp to point to align data already. this can be done by

  1. allocate the data on the stack with proper alignas keyword (alignas(16) char my_buffer[16*100];)
  2. allocate dynamic data with memory allocating functions capable of allocating align data, like aligned_alloc, or MSVC's _aligned_malloc which requires _aligned_free. See How to solve the 32-byte-alignment issue for AVX load/store operations?

You cannot align memory retroactively, it has to be allocated aligned in the first place. make sure the data passed by temp is already aligned, or use unaligned loads/stores if you can't require callers to pass aligned data.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
David Haim
  • 25,446
  • 3
  • 44
  • 78
  • 4
    If you compile it as C++, it's C++. You can say it's written in a C-like style, but that doesn't change how the compiler interprets it. There are subtle differences between C and C++. Perhaps this code came from a C example, and the OP is using it in C++. – Peter Cordes Oct 06 '17 at 06:10
  • 2
    you can also write `__asm{mov eax, 8}` and compile it inside a C++ program, but that doesn't make the assembly language C++. – David Haim Oct 06 '17 at 06:11
  • 2
    I mostly agree with that, but still think you should word your point 1 differently. If you want to criticize the use of C style in C++ code, sure. But that doesn't mean it's C. Union type punning would still be undefined behaviour (except that I think MSVC does define it, same as GNU C++.) But you could argue that in some ways it does make the asm part of a MSVC++ program. MSVC inline asm has small differences from MASM (e.g. it accepts `0xdeadbeef` as well as `0deadbeefH`.) – Peter Cordes Oct 06 '17 at 06:21