3

i have a question about the FP peak performance of my core i7 920. I have an application that does a lot of MAC operations (basically a convolution operation), and i am not able to reach the peak FP performance of the cpu by a factor of ~8x when using multi-threading and SSE instructions. When trying to find out what the reason was for this i ended up with a simplified code snippet, running on a single thread and not using SSE instructions which performs equally bad:

for(i=0; i<49335264; i++)
{
    data[i] += other_data[i] * other_data2[i];
}

If i'm correct (the data and other_data arrays are all FP) this piece of code requires:

49335264 * 2 = 98670528 FLOPs

It executes in ~150 ms (i'm very sure this timing is correct, since C timers and the Intel VTune Profiler give me the same result)

This means the performance of this code snippet is:

98670528 / 150.10^-3 / 10^9 = 0.66 GFLOPs/sec

Where the peak performance of this cpu should be at 2*3.2 GFlops/sec (2 FP units, 3.2 GHz processor) right?

Is there any explanation for this huge gap? Because i cannot explain it.

Thanks a lot in advance, and i could really use your help!

Ricky
  • 1,673
  • 2
  • 19
  • 23
  • i know, but each core has 2 FPUs right. I can look into the assembly code to see how the operations are translated into instructions, thanks for the hint. i'll come back on that – Ricky Feb 29 '12 at 14:19
  • Related: http://stackoverflow.com/q/8389648/922184 Basically, it takes a lot of work to get up to peak performance. – Mysticial Feb 29 '12 at 19:07
  • 1
    Also, modern FPUs don't even take the same amount of time for each instruction. Some inputs will make the same instruction take longer than others, and FLOPs are fundamentally a bogus unit. – zmccord Feb 29 '12 at 22:06
  • And even then there's the loop overhead . . . – geometrian Sep 21 '15 at 00:41

4 Answers4

5

I would use SSE.

Edit: I run some more tests by myself and discovered that your program is neither limited by memory bandwidth (the theoretical limit is about 3-4 times higher than your result) nor floating point performance (with an even higher limit), it is limited by lazy allocation of memory pages by the OS.

#include <chrono>
#include <iostream>
#include <x86intrin.h>

using namespace std::chrono;

static const unsigned size = 49335264;

float data[size], other_data[size], other_data2[size];

int main() {
#if 0
        for(unsigned i=0; i<size; i++) {
                data[i] = i;
                other_data[i] = i;
                other_data2[i] = i;
        }
#endif
    system_clock::time_point start = system_clock::now();
        for(unsigned i=0; i<size; i++) 
                data[i] += other_data[i]*other_data2[i];

    microseconds timeUsed = system_clock::now() - start;

    std::cout << "Used " << timeUsed.count() << " us, " 
              << 2*size/(timeUsed.count()/1e6*1e9) << " GFLOPS\n";
}

Translate with g++ -O3 -march=native -std=c++0x. The program gives

Used 212027 us, 0.465368 GFLOPS

as output, although the hot loop translates to

400848:       vmovaps 0xc234100(%rdx),%ymm0
400850:       vmulps 0x601180(%rdx),%ymm0,%ymm0
400858:       vaddps 0x17e67080(%rdx),%ymm0,%ymm0
400860:       vmovaps %ymm0,0x17e67080(%rdx)
400868:       add    $0x20,%rdx
40086c:       cmp    $0xbc32f80,%rdx
400873:       jne    400848 <main+0x18>

This means it is fully vectorized, using 8 floats per iteration and even taking advantage of AVX. After playing around with streaming instruction like movntdq, which don't bought anything, I decided to actually initialize the arrays with something - otherwise they will be zero pages, which only get mapped to real memory if they are written to. Changing the #if 0 to #if 1 immediately yields

Used 48843 us, 2.02016 GFLOPS

Which comes pretty close to the memory bandwith of the system (4 floats a 4 bytes per two FLOPS = 16 GBytes/s, theoretical limit is 2 Channels of DDR3 each 10,667 GBytes/s).

Gunther Piez
  • 29,760
  • 6
  • 71
  • 103
  • thanks for your reaction. I know i can use sse instructions etcetera to improve the performance. Actually i already have. But i want to know why i cannot reach the processors peak performance, and to find out i have to know why (without MT and SSE) i cannot reach the potential of the processor – Ricky Mar 01 '12 at 12:53
  • @DanLeakin Updated with some more results. – Gunther Piez Mar 01 '12 at 16:10
  • thanks for your time, but i don't understand your last point about the memory saturation: i agree i need 16 bytes / 2 FLOPS, but when dividing the total number of bytes by the execution time from your example i get ~8 GB/s for required bandwidth, not 16 GB/s. By the way this code gets to 1.3 GFLOPS on my cpu. – Ricky Mar 02 '12 at 10:04
  • You access (3 reads + 1 write)*(4 bytes per float)*49335264 which is roughly 800 MB in about 0,05 sec, which is 16 GByte/s. – Gunther Piez Mar 02 '12 at 10:25
  • Possibly you hit the memory bandwidth on your system (as I wrote earlier) but I can not rule out a cache problem - reading into cache, updating and evicting if the cache gets full might be slower than reading and writing directly to memory. You can load and store the data with `movntdqa` and `movntdq` instructions to avoid cache pollution. I tried it with the streaming instructions on my system, but it didn't change the performance and the result is consistent with the theoretical limit on my system. – Gunther Piez Mar 02 '12 at 10:30
  • Sorry i miscalculated ;) Do you have any example maybe of some similar code that is able to reach the peak FP performance of one core? – Ricky Mar 02 '12 at 10:54
  • If I just change the `for(unsigned i=0; i>16; i++)` I immediately get 20 GFlops. The limit for my system would be 8(AVX)*4.5GHz = 36 GFlops per core. With some clever unrolling and interleaving you can get pretty close, as shown in http://stackoverflow.com/q/8389648 – Gunther Piez Mar 02 '12 at 11:32
3

The explanation is simple: while your processor can run at (say) 6.4GHz, your memory sub-system can only feed data in/out at about 1/10th that rate (broad rule-of-thumb for most current commodity CPUs). So achieving a sustained flops rate of 1/8th of the theoretical maximum for your processor is actually very good performance.

Since you seem to be dealing with about 370MB of data, which is probably larger than the caches on your processor, your computation is I/O bound.

High Performance Mark
  • 77,191
  • 7
  • 105
  • 161
  • 1
    Moreover, floating point multiplication is not necessarily a single cycle operation. – swegi Feb 29 '12 at 14:06
  • Thanks for your answer. The peak stream b/w of my cpu is ~16.4 GB/s right? So let's say every iteration i require 3 FP reads and 1 FP write, or 16 bytes of bandwidth. This would require 789.364.224 bytes of traffic in the entire application, which runs in ~150 ms. This would mean i use 789.364.224 / 150 * 10^-3 / 10^9 = 5.26 GB/s. So i would say i don't hit this bandwidth ceiling? – Ricky Feb 29 '12 at 14:09
  • You'll never exactly hit the theoretical bound for memory throughput on your platform (for a number of reasons), but one thing that could affect your particular situation is your memory configuration. The Core i7-920 has a three-channel memory interface; if you don't have memory installed in all three channels (i.e. at least three separate modules), then your peak performance will scale down accordingly. – Jason R Feb 29 '12 at 14:23
  • I tried changing the operation within the loop to: " data[i] += 2.0 * 5.0 " and this gives the same results, which shouldn't suffer from the problem you describe right? – Ricky Feb 29 '12 at 14:30
  • Your modified test should put less of a load on the memory subsystem, yes. I didn't see you say it anywhere explicitly, so I'll ask: are you using `float` or `double` types? – Jason R Feb 29 '12 at 14:53
  • i ran the stream benchmark on my system and it reported a stream BW of ~9 GB/s, so this indeed cannot be the problem. – Ricky Mar 01 '12 at 13:38
1

As High Performance Mark explained, your test is very likely to be memory bound rather than compute-bound.

One thing I'd like to add is that to quantify this effect, you can modify the test so that it operates on data that fits into the L1 cache:

for(i=0, j=0; i<6166908; i++) 
{ 
    data[j] += other_data[j] * other_data2[j]; j++;
    data[j] += other_data[j] * other_data2[j]; j++; 
    data[j] += other_data[j] * other_data2[j]; j++;
    data[j] += other_data[j] * other_data2[j]; j++;
    data[j] += other_data[j] * other_data2[j]; j++;
    data[j] += other_data[j] * other_data2[j]; j++; 
    data[j] += other_data[j] * other_data2[j]; j++; 
    data[j] += other_data[j] * other_data2[j]; j++; 

    if ((j & 1023) == 0) j = 0;
} 

The performance of this version of the code should be closer to the theoretical maximum of FLOPS. Of course, it presumably doesn't solve your original problem, but hopefully it can help understand what's going on.

Igor ostrovsky
  • 7,282
  • 2
  • 29
  • 28
  • Running this code gives me ~0.75 GFlops, so it is indeed a little higher than my previous performance but still considerably low. Could it be because of the data structure i use, including pointers that the memory behavior and thus the performance is so bad? – Ricky Mar 01 '12 at 10:26
0

I looked at the assembly code of the multiply-accumulate of the code snippet in my first post and it looks like:

movq  0x80(%rbx), %rcx
movq  0x138(%rbx), %rdi
movq  0x120(%rbx), %rdx
movq  (%rcx), %rsi
movq  0x8(%rdi), %r8
movq  0x8(%rdx), %r9
movssl  0x1400(%rsi), %xmm0
mulssl  0x90(%r8), %xmm0
addssl  0x9f8(%r9), %xmm0
movssl  %xmm0, 0x9f8(%r9)

I estimated from the total number of cycles that it takes ~10 cycles to execute the multiply-accumulate.

The problem seems to be that the compiler is unable to pipeline the execution of the loop, even though there are no inter-loop dependencies, am i correct?

Does anybody have any other ideas / solutions for this?

Thanks for the help so far!

Ricky
  • 1,673
  • 2
  • 19
  • 23
  • intel compiler, the latest version – Ricky Mar 01 '12 at 12:28
  • Even if perfectly scheduled, the code is limited by load/store operations in this case. The theoretical maximum for this processor is 1 load + 1 store per clock if in L1 cache, so you need at least 2 clocks getting the operands. So the theoretical maximum would be 1 FLOP per cycle. But as long as the code is so poorly scheduled as in your snippet, additional latencies add up - the `mulss` instruction has a latency of 4 clocks, so an rate of a quarter of the theoretical maximum seems entirely plausible. – Gunther Piez Mar 01 '12 at 13:40
  • I understand your comment, but isn't it possible for the compiler to pipeline this since there are no inter-loop dependencies? Or is it just not possible to fetch that much data that fast, even though the other operations could be pipelined? – Ricky Mar 01 '12 at 14:07
  • Can the compiler prove that `data`, `other_data` and `other_data2` can not alias each other? If so it should clearly reorder the instructions in a more efficient way. So either you are using to much pointers somewhere and the compilers has to assume possible aliasing or the compiler just can't optimize it. – Gunther Piez Mar 01 '12 at 14:38
  • i defined static arrays for the data, other_data and other_data2 as: float data[size] and operating on these arrays, and this gives me a performance of 1.3 GFlops. So the use of pointers and the datastructure does have an influence but doesn't close the entire gap to the peak performance – Ricky Mar 01 '12 at 16:03