Successful compilation of SSE instruction with qmake (but SSE2 is not recognized)

Question

I'm trying to compile and run my code migrated from Unix to windows. My code is pure C++ and not using Qt classes. it is fine in Unix.

I'm also using Qt creator as an IDE and qmake.exe with -spec win32-g++ for compiling. As I have sse instructions within my code, I have to include emmintrin.h header.

I added:

QMAKE_FLAGS_RELEASE += -O3 -msse4.1 -mssse3 -msse3 -msse2 -msse

QMAKE_CXXFLAGS_RELEASE += -O3 -msse4.1 -mssse3 -msse3 -msse2 -msse

In the .pro file. I have been able to compile my code without errors. but after running it gives run-time error while going through some functions containing __m128 or like that.

When I open emmintrin.h, I see:

#ifndef __SSE2__
# error "SSE2 instruction set not enabled"
#else

and It is undefined after #else.

I don't know how to enable SSE in my computer.

Platform: Windows Vista

System type: 64-bit

Processor: intel(R) Core(TM) i5-2430M CPU @ 2.40Hz

Does anyone know the solution?

Thanks in advance.

Can you be more specific about the run-time errors ? Please copy and paste the actual error message(s). — Paul R, Sep 06 '13 at 08:51
It is not the issue of run-time error. access violation though. The main problem is that `__SSE__` is not defined. I'm sure about the rest of the code. — Hamid Bazargani, Sep 06 '13 at 09:18
What makes you think that `__SSE__` is not defined ? You seem to be confused by what you've seen in a header, which is not relevant here. — Paul R, Sep 06 '13 at 12:33
Yes, I think I got confused. Because I see gray lines in Qt creator which means those line will not be compiled. moreover the IDE says `__m128` is not a type name. But when run the code step by step in debug mode. SSE part of the code goes well. I think the problem is somewhere else. It crashes while executing `_mm_load_ps()` function. Maybe due to bad input allocation. BTW I'm wondering whether that function is supported in my machine or....? — Hamid Bazargani, Sep 06 '13 at 12:40
This is why I asked you to post the actual run-time error message - it sounds like a typical SSE misaligned data problem. Try aligning your data correctly or use `_mm_loadu_ps` instead of `_mm_load_ps`. — Paul R, Sep 06 '13 at 12:42
Thanks. changing to `_mm_loadu_ps` solved it. But how can I align my data. my data is struct of two float variables. and I defined `MYSTRUCT *data`. The func input is `_mm_loadu_ps((float*) data)`. Anyway your answer solved my question. — Hamid Bazargani, Sep 06 '13 at 13:50
On Windows/Visual Studio you can use the `declspec(align(16))` attribute for static allocations or `_aligned_malloc` for dynamic allocations. For gcc and most other civilised platforms/compilers use `__attribute__ ((align(16)))` for the former and `posix_memalign` for the latter. — Paul R, Sep 06 '13 at 14:23
P.S. since this seems to have solved your problem I've converted the above comments into an answer now (see below). — Paul R, Sep 06 '13 at 14:27

score 3 · Accepted Answer · answered Sep 06 '13 at 14:26

3

It sounds like your data is not 16 byte aligned, which is a requirement for SSE loads such as mm_load_ps. You can either:

use _mm_loadu_ps as a temporary workaround. On newer CPUs the performance hit for misaligned loads such as this is fairly small (on older CPUs it's much more significant), but it should still be avoided if possible

or

fix your memory alignment. On Windows/Visual Studio you can use the declspec(align(16)) attribute for static allocations or _aligned_malloc for dynamic allocations. For gcc and most other civilised platforms/compilers use __attribute__ ((align(16))) for the former and posix_memalign for the latter.

answered Sep 06 '13 at 14:26

Paul R

208,748
37
389
560

I don't agree that the performance hit for misaligned loads is fairly small. It was not small the last time I check on my i7-2600K (Sandy Bridge) - my GEMM code drops significantly in efficiency using misaligned data. What is small is the performance difference of `loadu` and `load` when the data is aligned, that used to not be the case (i.e. `load` is not really necessary anymore). However, `loadu` on misaligned data is still much slower. – Z boson Sep 06 '13 at 20:19
1

I wrote a Windows program to measure the _mm_loadu_ps penalty. The program loads floats from an array using _mm_loadu_ps or _mm_load_ps and displays execution time for each. The benchmark loop also contains an add instruction to keep optimizers from removing the load. Building with mingw64/gcc 4.8 or Visual Studio give the same result for SB (@4.0GHz): 0.75 ns per loop for _mm_load_ps and 0.78 ns per loop for _mm_loadu_ps. So if I coded the test correctly, Sandy Bridge does an impressive job of reducing the unaligned load overhead. The source code is here: http://notabs.org/misc/alignment.7z. – Sep 07 '13 at 04:03
@ScottD, Thanks for the code! I looked briefly at it. It looks quite good. One thing I noticed is that you're only testing the L1 region. I'll have to look into this again. – Z boson Sep 07 '13 at 05:44
1

@ScottD, so I ran your code on my i5-3317U CPU @ 1.70GHz under Linux (after some modifications) and I get the 1.22ns for unaligned and 1.17 for aligned. That's about 4% slower. Based on that Paul's statement "the performance hit for misaligned loads such as this is fairly small (on older CPUs it's much more significant), but it should still be avoided if possible." is accurate. I'll have to check this more carefully. – Z boson Sep 07 '13 at 09:31
@redrum: there are cases where misaligned loads can have an *indirect* effect on performance, mainly where L1 cache footprint is critical. Misaligned loads where the vector is split across cache lines can cause additional cache lines to be loaded, so cache usage may become less efficient as more cache lines than strictly necessary are evicted. Similarly for TLB entries, DRAM pages, etc, but these are probably less likely to be a factor for aligned versus misaligned loads. – Paul R Sep 07 '13 at 10:33
1

@redrum, a pain to port that code I know. I tried some bigger array sizes and the results were the same until the total buffer size exceeded the 8MB L3 cache size. After that, _mm_loadu_ps and _mm_load_ps slowed equally. An AMD processor A6-3650 also handles unaligned well. But even so, aligning data is beneficial and a small price to pay for increased portability and friendliness with older processors. As Paul R explained, valuable L1 cache space is wasted when the array is not aligned, and that could impact performance for certain loops. – Sep 07 '13 at 15:43
1

@PaulR, Thank you for the information. Maybe that's what I saw in the past (when I first started writing my GEMM code). I seem to recall much worse performance with misaligned loads and since then I have not bothered to test misaligned memory. If I mange to reproduce/recover the tests I did I'll post a SO question about them. – Z boson Sep 07 '13 at 18:43
2

@ScottD, it's fairly easy to make your code cross platform/compiler. [Use `_mm_malloc` instead of `_aligned_malloc`](http://stackoverflow.com/questions/16376942/best-cross-platform-method-to-get-aligned-memory) (which was my question in a previous SO life) as well as `omp_get_wtime()` (instead of the windows performance timer) and your code will work on GCC, MinGW, MSVC, and ICC with Linux and Windows (and probably OSX). The only part I did not implement was changing the task priority. I don't have a good solution for that so I commented it out. – Z boson Sep 07 '13 at 18:48

Successful compilation of SSE instruction with qmake (but SSE2 is not recognized)

1 Answers1

Linked