SSE-copy, AVX-copy and std::copy performance

Question

I'm tried to improve performance of copy operation via SSE and AVX:

    #include <immintrin.h>

    const int sz = 1024;
    float *mas = (float *)_mm_malloc(sz*sizeof(float), 16);
    float *tar = (float *)_mm_malloc(sz*sizeof(float), 16);
    float a=0;
    std::generate(mas, mas+sz, [&](){return ++a;});
    
    const int nn = 1000;//Number of iteration in tester loops    
    std::chrono::time_point<std::chrono::system_clock> start1, end1, start2, end2, start3, end3; 
    
    //std::copy testing
    start1 = std::chrono::system_clock::now();
    for(int i=0; i<nn; ++i)
        std::copy(mas, mas+sz, tar);
    end1 = std::chrono::system_clock::now();
    float elapsed1 = std::chrono::duration_cast<std::chrono::microseconds>(end1-start1).count();
    
    //SSE-copy testing
    start2 = std::chrono::system_clock::now();
    for(int i=0; i<nn; ++i)
    {
        auto _mas = mas;
        auto _tar = tar;
        for(; _mas!=mas+sz; _mas+=4, _tar+=4)
        {
           __m128 buffer = _mm_load_ps(_mas);
           _mm_store_ps(_tar, buffer);
        }
    }
    end2 = std::chrono::system_clock::now();
    float elapsed2 = std::chrono::duration_cast<std::chrono::microseconds>(end2-start2).count();
     
    //AVX-copy testing
    start3 = std::chrono::system_clock::now();
    for(int i=0; i<nn; ++i)
    {
        auto _mas = mas;
        auto _tar = tar;
        for(; _mas!=mas+sz; _mas+=8, _tar+=8)
        {
           __m256 buffer = _mm256_load_ps(_mas);
           _mm256_store_ps(_tar, buffer);
        }
    }
    end3 = std::chrono::system_clock::now();
    float elapsed3 = std::chrono::duration_cast<std::chrono::microseconds>(end3-start3).count();
    
    std::cout<<"serial - "<<elapsed1<<", SSE - "<<elapsed2<<", AVX - "<<elapsed3<<"\nSSE gain: "<<elapsed1/elapsed2<<"\nAVX gain: "<<elapsed1/elapsed3;
    
    _mm_free(mas);
    _mm_free(tar);

It works. However, while the number of iterations in tester-loops - nn - increases, performance gain of simd-copy decreases:

nn=10: SSE-gain=3, AVX-gain=6;

nn=100: SSE-gain=0.75, AVX-gain=1.5;

nn=1000: SSE-gain=0.55, AVX-gain=1.1;

Can anybody explain what is the reason of mentioned performance decrease effect and is it advisable to manually vectorization of copy operation?

I believe I read somewhere (Agner Fog ?) that due to the aggressive power management on Haswell that there can be a "ramp up" time (several hundred cycles ?) when you start using a previously idle execution unit such as SSE/AVX. For small nn this may be distorting your measurements. You should look at the absolute times (per element) as well as the ratios to verify this. — Paul R, Aug 19 '13 at 13:24
@PaulR But here SSE/AVX are getting slower, not faster... It's a ramp down, not a ramp up — xanatos, Aug 19 '13 at 13:42
@xanatos: yes, but perhaps `std::copy` is already using SSE/AVX, and the ramp up is impacting mainly `std::copy` and not the subsequent hand-coded SIMD copies. You could test this by changing the order of the copies I suppose. — Paul R, Aug 19 '13 at 13:44
You'd need to check the source of `std::copy` first. It's possible that it already has some funky vectorization improvements implemented. Also, issues like these are rather hard to track down without the assembly code produced by the compiler. — Daniel Kamil Kozar, Aug 19 '13 at 13:49
FWIW, I'm unable to reproduce this on VS2012 with an Intel Core i7 2600K. Using `nn = 1000` is too small to measure. Going up to `nn = 1000000` shows `SSE gain: 1.02222` and `AVX gain: 1.70371` - which is what I'd expect to see if the compiler is only using SSE by itself. — Mysticial, Aug 19 '13 at 13:57
@Mysticial, I run on same processor (core i7 2600k), ubuntu 12.10, gcc 4.6 with flags - (-O2 -msse -msse2 -msse4.2 -mavx -mfpmath=sse): nn = 1000000 shows SSE gain: 0.55 and AVX gain: 1.13. Very strange. — gorill, Aug 19 '13 at 14:04
Your code contains a bug: AVX aligned copy's require 32 byte alignment, but you only request 16 byte alignment. Additional, i think the size of your test case is severely flawed. On windows you're luckey if system clock implements 1ms precision, but the results of your test case run in the microsecond range on my system (i7-2820QM). If i add a couple of zeroes here and there the results are quite close (~5%). Don't forget to warm up your processor... — Stefan, Jun 04 '14 at 22:25
@Paul. MSVC11 std:copy calls memcpy(), which does not use SSE or AVX. It uses movnti qword instructions to move 4 bytes at a time and avoids cache pollution. There is also various prefetch magic in the assembly. — Stefan, Jun 05 '14 at 06:02
@Stefan: thanks - that's useful information, but I believe the OP mentioned that he is running Linux. — Paul R, Jun 05 '14 at 06:11
Related: https://stackoverflow.com/questions/43343231/enhanced-rep-movsb-for-memcpy, the canonical Q&A about x86 memory bandwidth. — Peter Cordes, Oct 19 '17 at 21:12

score 25 · Accepted Answer · edited Oct 19 '17 at 13:17

The problem is that your test does a poor job to migrate some factors in the hardware that make benchmarking hard. To test this, I've made my own test case. Something like this:

for blah blah:
    sleep(500ms)
    std::copy
    sse
    axv

output:

SSE: 1.11753x faster than std::copy
AVX: 1.81342x faster than std::copy

So in this case, AVX is a bunch faster than std::copy. What happens when I change to test case to..

for blah blah:
    sleep(500ms)
    sse
    axv
    std::copy

Notice that absolutely nothing changed, except the order of the tests.

SSE: 0.797673x faster than std::copy
AVX: 0.809399x faster than std::copy

Woah! how is that possible? The CPU takes a while to ramp up to full speed, so tests that are run later have an advantage. This question has 3 answers now, including an 'accepted' answer. But only the one with the lowest amount of upvotes was on the right track.

This is one of the reasons why benchmarking is hard and you should never trust anyone's micro-benchmarks unless they've included detailed information of their setup. It isn't just the code that can go wrong. Power saving features and weird drivers can completely mess up your benchmark. One time i've measured an factor 7 difference in performance by toggling a switch in the bios that less than 1% of notebooks offer.

This answer makes some extremely important points, without which the whole discusstion would be useless. But I'm afraid it is not entirely correct either. It states "The CPU takes a while to ramp up to full speed", however, the problem here seems more likely related to caching. A good test must (at a minimum) be run multiple times in a loop to mitigate this, NEVER just once. — mafu, May 05 '17 at 16:37
So about that "detailed test setup", what OS and CPU did *you* test this on? It's before August 2015, so we know it's not Skylake (which introduced hardware P-states for much quicker ramp-up to full clock speed). But we don't know if you're on AMD Bulldozer or Intel SnB or Haswell or what. — Peter Cordes, Oct 19 '17 at 20:01
@PeterCordes I used an i7-2820QM (mobile) sandy bridge processor and some desktop flavor of windows (probably windows 8, not sure). — Stefan, Oct 20 '17 at 16:07

score 9 · Answer 2 · edited Oct 19 '17 at 21:11

This is an very interesting question, but I believe non of the answers so far is correct because the question itself is so misleading.

The title should be changed to "How does one reach the theoretical memory I/O bandwidth ?"

No matter what instruction set is used, CPU is so much faster than RAM that pure block memory copy is 100% I/O bounded. And this explains why there is little difference between SSE and AVX performance.

For small buffers hot in L1D cache, AVX can copy significantly faster than SSE on CPUs like Haswell where 256b loads/stores really do use a 256b data path to L1D cache instead of splitting into two 128b operations.

Ironically, ancient X86 instruction rep stosq performs much better than SSE and AVX in terms of memory copy!

The article here explains how to saturate memory bandwidth really well and it has rich references to explore further as well.

See also Enhanced REP MOVSB for memcpy here on SO, where @BeeOnRope's answer discusses NT stores (and non-RFO stores done by rep stosb/stosq) vs. regular stores, and how single-core memory bandwidth is often limited by max concurrency / latency, not by the memory controller itself.

rep stosq doesn't perform better, especially on small blocks and on modern CPUs (released after 2014), because rep stos has initial startup costs of about 35 cycles, and during 35 cycles you can do 35 loads and 35 stores of 32 bytes using AVX. — Maxim Masiutin, May 09 '17 at 21:39
Thanks Max! it is good to know that CPUs after 2014 are considered to be modern :). — PhD AP EcE, Jun 14 '17 at 21:00

score 3 · Answer 3 · edited Jun 04 '14 at 20:22

Writing fast SSE is not as simple as using SSE operations in place of their non-parallel equivalents. In this case I suspect your compiler cannot usefully unroll the load/store pair and your time is dominated by stalls caused by using the output of one low-throughput operation (the load) in the very next instruction (the store).

You can test this idea by manually unrolling one notch:

//SSE-copy testing
start2 = std::chrono::system_clock::now();
for(int i=0; i<nn; ++i)
{
    auto _mas = mas;
    auto _tar = tar;
    for(; _mas!=mas+sz; _mas+=8, _tar+=8)
    {
       __m128 buffer1 = _mm_load_ps(_mas);
       __m128 buffer2 = _mm_load_ps(_mas+4);
       _mm_store_ps(_tar, buffer1);
       _mm_store_ps(_tar+4, buffer2);
    }
}

Normally when using intrinsics I disassemble the output and make sure nothing crazy is going on (you could try this to verify if/how the original loop got unrolled). For more complex loops the right tool to use is the Intel Architecture Code Analyzer (IACA). It's a static analysis tool which can tell you things like "you have pipeline stalls".

This is not the answer. OP does not ask why his SSE/AVX code differs in performance with std::copy. He asks why the performance characteristics change when `nn` changes. — Stefan, Jun 05 '14 at 19:50
This should help some, but hardware memory reordering already allows it to delay stores. Unless there's 4k aliasing between a store and the *next* load, there shouldn't be a problem. (Assuming both buffers have the same alignment relative to a 4k page, the memory disambiguation hardware can tell that stores don't overlap with later loads just by looking at the page-offset bits.) — Peter Cordes, Oct 19 '17 at 21:16

score 3 · Answer 4 · edited Dec 17 '21 at 09:39

I think this is because the measurement is not accurate for kinda short operations.

When measuring performance on Intel CPU

Disable "Turbo Boost" and "SpeedStep". You can to this on system BIOS.
Change Process/Thread priority to High or Realtime. This will keep your thread running.
Set Process CPU Mask to only one core. CPU Masking with Higher priority will minimize context switching.
use __rdtsc() intrinsic function. Intel Core series returns CPU internal clock counter with __rdtsc(). You will get 3400000000 counts/second from 3.4Ghz CPU. And __rdtsc() flushes all scheduled operations in CPU so it can measure timing more accurate.

This is my test-bed startup code for testing SSE/AVX codes.

    int GetMSB(DWORD_PTR dwordPtr)
    {
        if(dwordPtr)
        {
            int result = 1;
    #if defined(_WIN64)
            if(dwordPtr & 0xFFFFFFFF00000000) { result += 32; dwordPtr &= 0xFFFFFFFF00000000; }
            if(dwordPtr & 0xFFFF0000FFFF0000) { result += 16; dwordPtr &= 0xFFFF0000FFFF0000; }
            if(dwordPtr & 0xFF00FF00FF00FF00) { result += 8;  dwordPtr &= 0xFF00FF00FF00FF00; }
            if(dwordPtr & 0xF0F0F0F0F0F0F0F0) { result += 4;  dwordPtr &= 0xF0F0F0F0F0F0F0F0; }
            if(dwordPtr & 0xCCCCCCCCCCCCCCCC) { result += 2;  dwordPtr &= 0xCCCCCCCCCCCCCCCC; }
            if(dwordPtr & 0xAAAAAAAAAAAAAAAA) { result += 1; }
    #else
            if(dwordPtr & 0xFFFF0000) { result += 16; dwordPtr &= 0xFFFF0000; }
            if(dwordPtr & 0xFF00FF00) { result += 8;  dwordPtr &= 0xFF00FF00; }
            if(dwordPtr & 0xF0F0F0F0) { result += 4;  dwordPtr &= 0xF0F0F0F0; }
            if(dwordPtr & 0xCCCCCCCC) { result += 2;  dwordPtr &= 0xCCCCCCCC; }
            if(dwordPtr & 0xAAAAAAAA) { result += 1; }
    #endif
            return result;
        }
        else
        {
            return 0;
        }
    }

    int _tmain(int argc, _TCHAR* argv[])
    {
        // Set Core Affinity
        DWORD_PTR processMask, systemMask;
        GetProcessAffinityMask(GetCurrentProcess(), &processMask, &systemMask);
        SetProcessAffinityMask(GetCurrentProcess(), 1 << (GetMSB(processMask) - 1) );
    
        // Set Process Priority. you can use REALTIME_PRIORITY_CLASS.
        SetPriorityClass(GetCurrentProcess(), HIGH_PRIORITY_CLASS);
    
        DWORD64 start, end;
        start = __rdtsc();
    // your code here.
        end = __rdtsc();
        printf("%I64d\n", end - start);
        return 0;
    }

Be warned: rdtsc() returns the amount of clock cycles since some point in time in base clock speed. If your CPU has turbo boost or power saving features, this will not return what you'd expect. Consider using throttlestop to lock you CPU at it's base frequency when running such benchmarks. — Stefan, Jun 04 '14 at 21:23
@Stefan, what is throttlestop? That sounds like something I want to employ. — Z boson, Jun 06 '14 at 10:54
Throttlestop is an simple no-nonsense tool that allow you to control the clock speeds of your CPU, afaik all CPU's since C2D are supported, even the mobile ones. http://www.thedigitalhq.com/downloads/download-info/throttlestop-6-00/. Generally, you want to always use this when running benchmarks to eliminate as many variables as possible. It only works on windows. — Stefan, Jun 06 '14 at 17:32

score -2 · Answer 5 · edited May 23 '17 at 12:02

-2

I think that your main problem/bottleneck is your _mm_malloc.

I highly suggest to use std::vector as your main data structure if you are concerned about locality in C++.

intrinsics are not exactly a "library", they are more like a builtin function provided to you from your compiler, you should be familiar with your compiler internals/docs before using this functions.

Also note that the fact that the AVX are a newer than SSE doesn't make the AVX faster, whatever you are planning to use, the number of cycles taken by an function is probably more important than the "avx vs sse" argument, for example see this answer.

Try with a POD int array[] or an std::vector.

edited May 23 '17 at 12:02

Community

1
1

answered Aug 19 '13 at 17:44

user2485710

9,451
13
58
102

5

You recommend `std::vector`, an data structure which gives no control over alignment, for a test-case which uses instructions that _requires_ correct alignment? Additionally, your `_mm_malloc` source specifically concerns the auto vectorizer. If `_mm_malloc` did _not_ work as expected, `_mm_load_ps` should generate an interrupt. – Stefan Jun 05 '14 at 19:57
@Stefan I'm suggesting because of the cache, not because of the alignment, plus I can't think of a container that will provide you with the correct alignment auto-magically, it's likely that you have to work with your `T` to get the appropriate alignment that you need. Also my answer never mentions this stuff, is clearly oriented to memory, cache and allocations, I can't see how your comment related to my answer. – user2485710 Jun 05 '14 at 21:21
I really don't see how using `std::vector` over `_mm_malloc` helps with the cache, or locality. Let alone that it can be a 'bottleneck' in this test case. `_mm_malloc` is simply a wrapper around `new`. – Stefan Jun 05 '14 at 21:37
@Stefan seriously, read my post before commenting, the `std::vector` is the part about locality and this is clearly expressed in my post. – user2485710 Jun 05 '14 at 21:38
Your post, in my opinion, clearly expresses that his main problem is `_mm_malloc` and that you recommend `std::vector` if you are concerned about locality, what am i missing here? – Stefan Jun 05 '14 at 21:57
`vsqrtps` and `vdivps` are the only AVX instructions that don't have full-width execution units in Haswell. (On Sandy/IvyBridge, 32B loads/stores are also split into 16B chunks), but the way you phrase it with that link to a question about SQRT implies that most AVX instructions are slower than their SSE equivalents. That's not the case on Intel CPUs: most are equal throughput (but for loads/stores that requires they hit in L1D cache... so memory bandwidth is really a separate issue from ALU throughput.) – Peter Cordes Oct 19 '17 at 21:21

SSE-copy, AVX-copy and std::copy performance

5 Answers5

Linked