SSE2, Visual Studio 2010, and Debug Build

Question

Can the compiler make automatic use of SSE2 while optimisations are disabled?

When optimisations are disabled, does the /arch:SSE2 flag mean anything?

I've been given the task of squeezing more performance out of our software. Unfortunately, release builds are done using the debug settings, and attempts to argue for the case of optimisation have been unsuccessful so far.

Compiling for x86 with compiler flags /ZI /Od /arch:SSE2 /FAs. The generated assembly shows that the compiler is not making use of SSE2. Is this because optimisation is disabled?

In the code, there are a few situations similar to this:

char* begin = &bufferObject;
char* end   = begin + sizeof(bufferObject);
char  result;
while ( begin != end ) {
    result ^= *begin++;
}

I'd like to have the compiler vectorise this operation for me, but it doesn't; I suspect optimisation needs to be enabled.

I hand-coded two solutions: one using an inline __asm block, and the other using the SSE2 intrinsicts defined in <emmintrin.h>. I'd prefer not to rely on this.

Update

Further to the questions above, I would like calls to library functions, like memcpy, to use the provided vectorised versions when appropriate. Looking at the assembly code for memcpy, I can see that there is a function called _VEC_memcpy which makes use of SSE2 for faster copying. The block which decides whether to branch to this routine or not is this:

    ; First, see if we can use a "fast" copy SSE2 routine
    ; block size greater than min threshold?
    cmp     ecx,080h
    jb      Dword_align
    ; SSE2 supported?
    cmp     DWORD PTR __sse2_available,0
    je      Dword_align
    ; alignments equal?
    push    edi
    push    esi
    and     edi,15
    and     esi,15
    cmp     edi,esi
    pop     esi
    pop     edi
    jne     Dword_align

    ; do fast SSE2 copy, params already set
    jmp     _VEC_memcpy

I don't think that _VEC_memcpy is being called... ever.

Should the `/arch:SSE2` flag be defining this `__sse2_available` symbol?

Have you tried running an optimized build and measuring the difference in performance? It could help make your argument stronger. — stonemetal, Jul 11 '12 at 01:13
Hmm... interesting. I've never looked at VC++'s `memcpy()` before. But that definitely could be a special case that it might actually vectorize. (along with `memset()`) — Mysticial, Jul 11 '12 at 01:20
`__sse2_available` looks like a global variable that's set by some CPU dispatcher (probably at program launch). It's run-time determined. So it's probably trying to use SSE2 even without `/arch:SSE2`. — Mysticial, Jul 11 '12 at 01:22
@stonemetal I'm currently battling with that. Some modules want to link with the CrtDebug library, so linkage fails. Is there a way to include this library in a release build? — Anthony, Jul 11 '12 at 01:44
@anthony-arnold The optimization settings and what runtime library you link against are independent. The only changes you have to make to a Debug build is go to a regular debug database, no edit and continue, and turn off the basic runtime checks(That is if you are using the IDEs default debug settings). Incremental linking causes a warning but works anyway. — stonemetal, Jul 11 '12 at 03:09

score 9 · Accepted Answer · answered Jul 11 '12 at 00:57

Visual Studio 2010 and earlier has no support for automatic vectorization at all.

The purpose of /arch:SSE2 is to allow the compiler to use scalar SSE for floating-point operations instead of the x87 FPU.

So you may get some speedup with /arch:SSE2 since it allows you to access more registers on x64. But keep it mind that it is not from vectorization.

If you want vectorization on VS2010, you pretty much have to do it manually with intrinsics.

Visual Studio 2012 has support for auto-vectorization:

http://msdn.microsoft.com/en-us/library/hh872235%28v=vs.110%29.aspx

score 4 · Answer 2 · answered Jul 11 '12 at 01:07

Trying to optimize code built with MSVC's debug settings is kind of a fool's errand, since the compiler is effectively going out of its way to make your code slow by eg juggling data onto and off the stack (which induces load-hit-store stalls) and other such things.

In any case, MSVC doesn't vectorize that block whether in Release or Debug. You'll need to use intrinsics to get it to emit the right machine code. This is /O2 /Ot /Oi /arch:SSE2 :

PUBLIC  ?VectorTest@@YADPAD0@Z              ; VectorTest
; Function compile flags: /Ogtp
;   COMDAT ?VectorTest@@YADPAD0@Z
_TEXT   SEGMENT
_begin$ = 8                     ; size = 4
_result$ = 11                       ; size = 1
_end$ = 12                      ; size = 4
?VectorTest@@YADPAD0@Z PROC             ; VectorTest, COMDAT

; 143  : {

    push    ebp
    mov ebp, esp

; 144  :    char  result;
; 145  :    while ( begin != end ) {

    mov ecx, DWORD PTR _begin$[ebp]
    mov edx, DWORD PTR _end$[ebp]
    mov al, BYTE PTR _result$[ebp]
    cmp ecx, edx
    je  SHORT $LN1@VectorTest
$LL2@VectorTest:

; 146  :        result ^= *begin++;

    xor al, BYTE PTR [ecx]
    inc ecx
    cmp ecx, edx
    jne SHORT $LL2@VectorTest
$LN1@VectorTest:

; 147  :    }
; 148  :    return result;
; 149  : }

    pop ebp
    ret 0
?VectorTest@@YADPAD0@Z ENDP             ; VectorTest
_TEXT   ENDS

Contemporary compilers are really lousy at vectorization, so we rely on using SSE intrinsics throughout our app. I doubt any compiler would vectorize that particular operation as it is essentially a "reduce" rather than a "map", and I've yet to see a compiler that does horizontal (non-orthogonal) vectorization.

Reduce operations are actually common enough for compilers to make a special case for them. I've seen both GCC and ICC vectorize reductions. (not that I'd rely on it though) — Mysticial, Jul 11 '12 at 01:10
@Mysticial I've never seen GCC do it for any of our code, but I imagine it's possible. — Crashworks, Jul 11 '12 at 01:10
Took me a while to find it. But here's one example from a while back where GCC was able to vectorize a reduction: http://stackoverflow.com/questions/7451342/how-could-this-java-code-be-sped-up — Mysticial, Jul 11 '12 at 01:14

SSE2, Visual Studio 2010, and Debug Build

Can the compiler make automatic use of SSE2 while optimisations are disabled?

When optimisations are disabled, does the /arch:SSE2 flag mean anything?

Update

Should the /arch:SSE2 flag be defining this __sse2_available symbol?

2 Answers2

Should the `/arch:SSE2` flag be defining this `__sse2_available` symbol?