Is it fair to compare SSE/AVX units to GPU cores?

Question

I have a presentation to make to people who have (almost) no clue of how a GPU works. I think saying that a GPU has a thousand cores where a CPU only has four to eight of them is a non-sense. But I want to give my audience an element of comparison.

After a few months working with NVidia's Kepler and AMD's GCN architectures, I'm tempted to compare a GPU "core" to a CPU's SIMD ALU (I don't know if they have a name for that at Intel). Is it fair ? After all, when looking at an assembly level, those programming models have much in common (at least with GCN, take a look at p2-6 of the ISA manual).

This article states that an Haswell processor can do 32 single-precision operations per cycle, but I suppose there is pipelining or other things happening to achieve that rate. In NVidia parlance, how many Cuda-cores does this processor have ? I would say 8 per CPU-core for 32 bits operations, but this is just a guess based on the SIMD width.

Of course there is many other things to take into account when comparing CPU and GPU hardware, but this is not what I'm trying to do. I just have to explain how the thing is working.

PS: All pointers to CPU hardware documentations or CPU/GPU presentations are greatly appreciated !

EDIT: Thanks for your answers, sadly I had to chose only one of them. I marked Igor's answer because it sticks the most to my initial question and gave me enough informations to justify why this comparison shouldn't be taken too far, but CaptainObvious provided very good articles.

In your comparison be sure to distinguish floating point and integer dominated algorithms. Floating point is usually what is emphasis but many peoples use GPUs (AMD GPUs) for bitcoin mining due to it's superior integer support http://www.tomshardware.com/reviews/bitcoin-mining-make-money,3514.html. — Z boson, Jul 03 '13 at 10:42

score 12 · Answer 1 · answered Jul 02 '13 at 14:33

12

I'd be very caution on making this kind of comparison. After all even in the GPU world the term "core" depending on the context has really different capability: the new AMD GCN is quite different from the old VLIW4 one which itself is quite different from the CUDA core one.
Besides that, you will bring more puzzlement than understanding to your audience if you make just one small comparison with CPU and that's it. If I were you I'd still go for a more detailed (can still be quick) comparison.
For instance someone used to CPU and with little knowledge of GPU, might wonder how come a GPU can have so many registers though it's so expensive (in the CPU world). An explanation to that question is given at the end of this post as well as some more comparison GPU vs CPU.

This other article gives a nice comparison between these two kind of processing units by explaining how GPUs work but also how they evolved and showing the differences with CPUs. It addresses topics like data flow, memory hierarchy but also for what kind of applications a GPU is useful. After all the power a GPU can developed is accessible (efficiently) only for some types of problems.
And personally, If I had to make a presentation about GPU and had the possibility to make only one reference to CPU it would be this: presenting the problems a GPU can solve efficiently vs those a CPU can handle better.
As a bonus even though it's not related directly to your presentation here is an article that put GPGPU in perspective, showing that some speedup claimed by some people are overrated (this is linked to my last point btw :))

answered Jul 02 '13 at 14:33

CaptainObvious

2,525
20
26

2

+1 for pointing out that GPGPU performance is often over-hyped. – Paul R Jul 02 '13 at 14:36
+1, thanks ! I'm not setting you're answer as accepted because I'm hoping for more feedbacks and maybe some precisions on the SSE/AVX hardware. I was definitely going to point out that those crazy speedups are often an indicator of bad CPU optimization. – Simon Jul 02 '13 at 14:57
1

@CaptainObvious, the article by Intel is worth reading but highly misleading. They take the average speedup of 14 cherry picked kernels many of which are not so SIMD friendly and then take the average of the performance and claim only a 2.5 speedup with the GPU. It's absurd to take the average! The GPU is not a general purpose device like the CPU. Like any tool, you need to know where it's most useful. Even though the claims of 50-100x performance boosts with GPUs are also ridiculous, for many SIMD friendly algorithms the GPU is about an order of magnitude faster. – Z boson Jul 04 '13 at 08:27
1

Also, I don't understand some of the numbers Intel claim. They say they get about 66% of the peak flops on the GTX 280 but their table has 360 Gflops/s for SGEMM and according to [wikipedia](http://en.wikipedia.org/wiki/Comparison_of_Nvidia_graphics_processing_units#GeForce_200_Series) the peak Gflops/s of the GTX280 is 933.120. That's only about 39% of the peak. Perhaps they did not use a SGEMM algorithm as optimized for the GPU as they claim. At least on GK110 it's possible to get over 70% of the peak. – Z boson Jul 04 '13 at 13:32
@redrum, I don't really agree with 2 of your statements. First I don't think it's **highly** misleading. I agree that the average doesn't mean anything, and a range for instance would have been more appropriate (especially in the conclusion). However they detailed all the speedups, analyzing for each the reasons of such numbers. – CaptainObvious Jul 12 '13 at 13:07
I don't agree either with the kernels that *many are not so SIMD friendly*. In the section 2 they clearly explain each of the algorithms stating when some are not SIMD friendly (or access memory friendly and so on). What interests me more is the fact that I almost saw for each of these algo some articles pretending to speed them up with GPU. Hence it seems that their choice make sence. – CaptainObvious Jul 12 '13 at 13:13
I agree though that they clearly screw up with the numbers for the SGEMM especially that they state: *Volkov et al. show that GPUs obtain only about 66% of the peak ﬂops even for SGEMM* though Volkov et al. state: *. Our matrix-matrix multiply routine (GEMM) runs 60% faster than the vendor implementation in CUBLAS 1.1 and approaches the peak of hardware capabilities.* Regarding the result they obtained (364 Gflops), let's no forget that the paper is few years old (most probably nvcc improved a lot) and that they used a library (not something they fine tuned) – CaptainObvious Jul 12 '13 at 13:22
They chose the average because it makes Intel look better. They repeat that average several times so that people will quote it. They might as well have added a prime sieve or even Fibonacci series in to the average to pull the average down further. BTW, I gave you +1, and I think the Intel paper is worth reading (which is part of the reason I up voted you) despite the fact that it's misleading. An updated paper not by AMD, Nvidia, or Intel would be interesting. – Z boson Jul 12 '13 at 14:16
Yes you are right they are biased. I obviously didn't express myself properly. Mainly I thought that the adjectives *highly* and *many* were too strong. – CaptainObvious Jul 12 '13 at 14:47
Article link is broken, would love to check it out. I also have a feeling GPU hype is more of a marketing campaign in many cases – Kari Nov 05 '18 at 22:46

Igor ostrovsky · Accepted Answer · 2013-07-02T18:09:57.677

Very loosely speaking, it is not entirely unreasonable to say that a Haswell core has about 16 CUDA cores, but you definitely don't want to take that comparison too far. You may want to be cautious about making that statement directly in a presentation, but I've found it to be useful to think of a CUDA core as being somewhat related to a scalar FP unit.

It may help if I explain why Haswell can perform 32 single-precision operations per cycle.

8 single-precision operations execute in each AVX/AVX2 instruction. When writing code that will run on a Haswell CPU, you can use AVX and AVX2 instructions which operate on 256-bit vectors. These 256-bit vectors can represent 8 single-precision FP numbers, 8 integers (32-bit) or 4 double-precision FP numbers.
2 AVX/AVX2 instructions can execute in each core per cycle, although there are some restrictions on which instructions can be paired up.
A fused multiply add (FMA) instruction technically performs 2 single-precision operations. FMA instructions perform "fused" operations such as A = A * B + C, so there are arguably two operations per scalar operand: a multiplication and an addition.

This article explains the above points in more detail: http://www.realworldtech.com/haswell-cpu/4/

In the total accounting, a Haswell core can perform 8 * 2 * 2 single-precision operations per cycle. Since CUDA cores support FMA operations as well, you cannot count that factor of 2 when comparing CUDA cores to Haswell cores.

A Kepler CUDA core has one single-precision floating-point unit, so it can perform one floating-point operation per cycle: http://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf, http://www.realworldtech.com/kepler-brief/

If I was putting together slides on this, I would have one section explaining how many FP operations Haswell can do per cycle: the three points above, plus you have multiple cores and possibly multiple processors. And, I'd have another section explaining how many FP operations a Kepler GPU can do per cycle: 192 per SMX, and you have multiple SMX units on the GPU.

PS.: I may be stating the obvious, but just to avoid confusion: the Haswell architecture also includes an integrated GPU, which has an altogether different architecture from the Haswell CPU.

Can the haswell really perform two FMA's per cycle in a sustained rate ? — nat chouf, Jul 03 '13 at 18:00
@natchouf, yes, http://stackoverflow.com/questions/15933100/how-to-use-fused-multiply-add-fma-instructions-with-sse-avx — Z boson, Jul 04 '13 at 11:00
Actually, maybe I misunderstood your question. I don't know if how well the doubling of the peak FLOPs/s can be achieved on Haswell in practice. I would expect that the MKL already supports it so that's a good way to test it (i.e. run SGEMM for a large matrix and see what the FLOPs/s is). — Z boson, Jul 04 '13 at 11:04

score 3 · Answer 3 · answered Jul 02 '13 at 17:33

3

I completely agree with CaptainObvious, especially that presenting the problems a GPU can solve efficiently vs those a CPU can handle better would be a good idea.

One way I like to compare CPUs and GPUs is by the number of operation/sec that they can perorm. But of course don't compare one cpu core to a multi-core gpu.

A SandyBridge core can perform 2 AVX op/cycles, that is crunch 8 double precision numbers/cycle. Hence, a computer with 16 Sandy-Bridge cores clocked at 2.6 GHz has a peak power of 333 Gflops.

A K20 computing module GK110 has a peak of 1170 Gflops, that is 3.5 times more. This is a fair comparaison in my opinion, and it should be emphasized that the peak performance is much easier to reach on CPU (some applications reach 80%-90% of peak) than on GPU (best cases I know are less than 50% of peak).

So to summerize, I would not go into architecture details, but rather state some shear numbers with the perspective that the peak is often far from reach on GPUs.

answered Jul 02 '13 at 17:33

nat chouf

736
5
10

2

Getting the peak performance on the GPU is not as bad as you claim. See this link showing peak performance for Nvidia and AMD for SGEMM. Nvidia get's over 70%. http://www.anandtech.com/show/6774/nvidias-geforce-gtx-titan-part-2-titans-performance-unveiled/3 Those numbers will improves with time as the algorithms improve. – Z boson Jul 03 '13 at 09:57
1

Additionally, the doubling of peak FLOPs/s for Haswell due to FMA3 does not come automatically for most applications. The applications either have to recompiled with a looser floating point model or the code has to changed to implement the FMA3 instructions directly. This means many applications are already below 50% of the peak on Haswell. – Z boson Jul 03 '13 at 10:46
I speak only for SandyBridge here, as I had no chance yet to work with Haswell. The doubling here is due to the 2 vector units beeing able to compute on independant vector registers simultaneously. And the number are coming from my own program :) Thanks for the link 70% is getting pretty good. – nat chouf Jul 03 '13 at 17:53
1

Which computer has 16 sandy bridge cores, you mean two 8 cores Xeon (4650L) processors? That's going to cost quite a bit (motherboard and two processors). A better metric is FLOPs/s/USD. – Z boson Jul 04 '13 at 08:20

score 1 · Answer 4 · answered Jul 02 '13 at 18:52

It's more fair to compare GPU to vectorized CPU units however if your audience has zero idea of how GPUs work, it seems fair to assume that they have a similar knowledge of vectorized SSE instructions.

For audiences such as these it's important to point out the high level differences, like how blocks of "cores" on the gpu share a scheduler and register file.

I would refer to the GTC Kepler architecture overview for a better idea of what the Kepler architecture looks like. This is also a reasonably graspable comparison between the two if you want to stick to the "gpu core" idea.

They do have knowledge of vectorized SSE instructions actually. At least at a software level. That's partly why I want to draw a parallel between CPU-SIMD and GPUs. — Simon, Jul 03 '13 at 07:07

Is it fair to compare SSE/AVX units to GPU cores?

4 Answers4