15

What is the difference between a single processing unit of CPU and single processing unit of GPU?
Most places I've come along on the internet cover the high level differences between the two. I want to know what instructions can each perform and how fast are they and how are these processing units integrated in the compete architecture?
It seems like a question with a long answer. So lots of links are fine.

edit:
In the CPU, the FPU runs real number operations. How fast are the same operations being done in each GPU core? If fast then why is it fast?
I know my question is very generic but my goal is to have such questions answered.

alphadog
  • 10,473
  • 2
  • 13
  • 19
  • This question is REALLY off topic for stack overflow... – Lilith Daemon Apr 17 '16 at 20:37
  • Where should I be? – alphadog Apr 17 '16 at 20:38
  • Honestly, not a clue. But this has nothing to do with programming. I'm not even sure if its a good idea for me to answer it as to not reward off topic questions. – Lilith Daemon Apr 17 '16 at 20:39
  • I'll repost somewhere else. *sad* – alphadog Apr 17 '16 at 20:40
  • Sorry to be the bearer of bad news. Ill add an answer atleast as far as is relevant to coding, but for architectural differences that is far to specific and offtopic. – Lilith Daemon Apr 17 '16 at 20:41
  • That'll be a new perspective to my question. *happy* – alphadog Apr 17 '16 at 20:44
  • 1
    Well, if you want to know what instructions "a" gpu can perform, you can take a look at intel's gpu documentation. They have very detailed ISA reference for their GPUs. However GPU architecture is very diverse, VLIW, SIMD and scalar machines have all been used. That is only intel's implementation. – user3528438 Apr 17 '16 at 20:46
  • By `single processing unit` I am assuming that you mean a single CPU/GPU core? – Lilith Daemon Apr 17 '16 at 20:49
  • Yes. I want to know the architectural differences. – alphadog Apr 17 '16 at 20:50
  • Sad thing that high level languages hide stuff. But I think it'll answer my question if I understand how the programming model of GPUs work at each core and in their interconnection array. – alphadog Apr 17 '16 at 20:52
  • 2
    The architectural difference are highly dependent on the the specific GPUs/CPUs. (They differ extremely even in the same class (one CPU vs another CPU) let alone components designed for completely different purposes. ) They really are `apples` and `oranges` both fruit, but COMPLETELY different in design and purpose. – Lilith Daemon Apr 17 '16 at 20:53
  • In the CPU, the FPU runs real number operations. How fast are the same operations being done in each GPU core? If fast then why is it fast? – alphadog Apr 17 '16 at 20:57
  • 1
    The interesting differences between CPUs and GPUs are at a higher level than FP multiply hardware. A single FP multiplier logic block in a CPU wouldn't be very different from the same in a GPU, AFAIK. It's in the logic that handles a stream of instructions with branches where you see the real differences. GPUs (AFAIK based on no experience programming them) aren't built to handle parallel algorithms with early-out conditions (like high-quality video encoding, e.g. x264). Note that GPU video-encoding is done on fixed-function hardware, *not* on the normal GPU execution units. – Peter Cordes Apr 18 '16 at 00:53
  • 2
    In my opinion this question is not off topic for SO. It might be borderline, but there is value in understanding the interaction between hardware and software in order to best map a problem to the appropriate architecture. While the question might be too broad, it should not be impossible to give a concise answer that explains the **main difference** between GPU and CPU. Certainly explaining every difference would be too broad. I think the following question related to caches has a similar scope to this one: http://stackoverflow.com/questions/944966/cache-memories-in-multicore-cpus. – Gabriel Southern Apr 18 '16 at 02:44
  • On top of all of this, CPUs and GPUs do not differ only by their architecture and purposes, but also by the way they can perform some mathematical operations. An advantage of GPUs can be, for example, the Fused-Multiply-Add (FMA) which is faster and closer to the real value when you perform a multiply and an add. On Nvidia CUDA GPUs : http://docs.nvidia.com/cuda/floating-point/index.html#fused-multiply-add-fma At the opposite, when the CPUs have the x87 extension, they handle your 64 bits floating point numbers on 80 bits register, handling more overflow and giving better accuracy. – Taro Apr 23 '16 at 23:43
  • Of course there is more than one reason for speed. I later read the architecture white paper of Tesla architecture. I think everyone should read that. – alphadog Apr 24 '16 at 03:35
  • @Taro, are you seriously talking about x87? AMD and Intel have offered FMA since 2011 and 2013 respectively. Each Haswell core can process two 256-bit wide FMA operations per cycle. – Z boson Apr 24 '16 at 14:31
  • @Zboson but x87 is still supported if you use it. It is just an example of the fact that CPUs and GPUs also have differences on top of the hardware conception. – Taro Apr 25 '16 at 07:33
  • @Taro x87 is still supported because x86 is backwards compatible but nobody recommends using x87 anymore. Backwards compatibility is a major difference between x86 and GPUs though. – Z boson Apr 25 '16 at 18:40
  • @Zboson as I said, it was just an example. It could have been any other instruction set. Again, it was just an example for new-kid to know that the hardware conception is not the unique difference between CPUs and GPUs. – Taro Apr 26 '16 at 08:00

3 Answers3

13

Short answer

The main difference between GPUs and CPUs is that GPUs are designed to execute the same operation in parallel on many independent data elements, while CPUs are designed to execute a single stream of instructions as quickly as possible.

Detailed answer

Part of the question asks

In the CPU, the FPU runs real number operations. How fast are the same operations being done in each GPU core? If fast then why is it fast?

This refers to the floating point (FP) execution units that are used in CPUs and GPUs. The main difference is not how a single FP execution unit is implemented. Rather the difference is that a CPU core will only have a few FP execution units that operate on independent instructions, while a GPU will have hundreds of them that operate on independent data in parallel.

GPUs were originally developed to perform computations for graphics applications, and in these applications the same operation is performed repeatedly on millions of different data points (imagine applying an operation that looks at each pixel on your screen). By using SIMD or SIMT operations the GPU reduces the overhead of processing a single instruction, at the cost of requiring multiple instructions to operate in lock-step.

Later GPGPU programming became popular because there are many types of programming problems besides graphics that are suited to this model. The main characteristic is that the problem is data parallel, namely the same operations can be performed independently on many separate data elements.

In contrast to GPUs, CPUs are optimized to execute a single stream of instructions as quickly as possible. CPUs use pipelining, caching, branch prediction, out-of-order execution, etc. to achieve this goal. Most of the transistors and energy spent executing a single floating point instruction is spent in the overhead of managing that instructions flow through the pipeline, rather than in the FP execution unit. While a GPU and CPU's FP unit will likely differ somewhat, this is not the main difference between the two architectures. The main difference is in how the instruction stream is handled. CPUs also tend to have cache coherent memory between separate cores, while GPUs do not.

There are of course many variations in how specific CPUs and GPUs are implemented. But the high-level programming difference is that GPUs are optimized for data-parallel workloads, while CPUs cores are optimized for executing a single stream of instructions as quickly as possible.

Gabriel Southern
  • 9,602
  • 12
  • 56
  • 95
9

Your question may open various answers and architecture design considerations. Trying to focus strictly to your question, you need to define more precisely what a "single processing unit" means.

On NVIDIA GPU, you have work arranged in warps which is not separable, that is a group of CUDA "cores" will all operate the same instruction on some data, potentially not doing this instruction - warp size is 32 entries. This notion of warp is very similar to the SIMD instructions of CPUs that have SSE (2 or 4 entries) or AVX (4 or 8 entries) capability. The AVX operations will also operate on a group of values, and different "lanes" of this vector unit may not do different operations at the same time.

CUDA is called SIMT as there is a bit more flexibility on CUDA "threads" than you have on AVX "lanes". However, it is similar conceptually. In essence, a notion of predicate will indicate whether the operations should be performed on some CUDA "core". AVX offers masked operations on its lane to offer similar behavior. Reading from and writing to memory is also different as GPU implement both gather and scatter where only AVX2 processors have gather and scatter is solely scheduled for AVX-512.

Considering a "single processing unit" with this analogy would mean a single CUDA "core", or a single AVX "lane" for example. In that case, the two are VERY similar. In practice both operate add, sub, mul, fma in a single cycle (throughput, latency may vary a lot though), in a manner compliant with IEEE norm, in 32bits or 64bits precision. Note that the number of double-precision CUDA "cores" will vary from gamer devices (a.k.a. GeForce) to Tesla solutions. Also, the frequency of each FPU type differs: discrete GPUs navigate in the 1GHz range where CPUs are more in the 2.x-3.xGHz range.

Finally, GPUs have a special function unit which is capable of computing a coarse approximation of some transcendental functions from standard math library. These functions, some of which are also implemented in AVX, LRBNi and AVX-512, perform much better than precise counterparts. The IEEE norm is not strict on most of the functions hence allowing different implementations, but this is more a compiler/linker topic.

Florent DUGUET
  • 2,786
  • 16
  • 28
  • 1
    This is the best answer by far. But how are CUDA "threads" a bit more flexible than AVX lanes? To me SIMT is just a name for an API/software trick where you use masking to hold and wait for each operation in a lane to finish. That's been possible fore many years with SSE using e.g. `movmsk`. I may have to ask a question about this. – Z boson Apr 24 '16 at 14:26
  • 1
    Very good point. The "bit more flexible" refers mainly to gather and scatter (interaction with memory). Gather has been around with GPU since early days of CUDA as well as scatter. Gather is only available in AVX2, and scatter only in the forthcoming AVX-512. Hence the difference tends to phase out. – Florent DUGUET Apr 24 '16 at 14:37
  • 1
    I was a aware of gather/scatter but have not really learned to appreciate it since gather sucks on x86. Well maybe it's okay on skylake but on broadwell and especially haswell it is no good. I had not really thought about it until now but maybe one reason is that GPUs run at a lower frequency which means it's easier to overcome the memory bandwith which maybe means it's easier to implement an effective gather/scatter than with high frequency CPUs. – Z boson Apr 25 '16 at 06:33
  • 1
    The GPU SMs have Load and Store units (dedicated hardware, memory fetch buffer, etc), which are dedicated to gather and scatter operations (gather is a very nice legacy of texture in graphics). Gather on CPU is very convenient if you do not know at compile time (or its too hard to know) that your data is aligned. When data is aligned, gather is not very expensive and coding-wise, its very comfortable. @Zboson, thanks for your interest in this topic, and your like of the analogy. – Florent DUGUET Apr 25 '16 at 06:41
1

In essence the major difference as far as writing code to run serially is clock speed of the cores. GPUs often have hundreds of fairly slow cores (Often modern GPUs have cores with speeds of 200-400 MHz) This makes them very bad at highly serial applications, but allows them to perform highly granulated and concurrent applications (such as rendering) with a great deal of efficiency.

A CPU however is designed to perform highly serial applications with little or no multi-threading. Modern CPUs often have 2-8 cores, with clock speeds in excess of 3-4 Ghz.

Often times highly optimized systems will take advantage of both resources to use GPUs for highly concurrent tasks, and CPUs for highly serial tasks.

There are several other differences such as the actual instruction sets, cache handling, etc, but those are out of scope for this question. (And even more off topic for SO)

Lilith Daemon
  • 1,473
  • 1
  • 19
  • 37
  • Where can I get more insight into this? I'm thinking about posting or somewhere. Already did at theoretical comp sci community on stackexchange. – alphadog Apr 17 '16 at 21:03
  • In all honesty, I don't know where to send you. The Stack Exchange Network is sadly not the right place for every question. Especially for something as broad as this. – Lilith Daemon Apr 17 '16 at 21:15
  • I think I'll have to do hours of work myself going through big coarse manuals. I just thought that someone who has worked with both architectures can give me a real insight into the crux of the thing. – alphadog Apr 17 '16 at 21:19
  • @new-kid Honestly, with the level of depth that you are looking for, manuals would probably be the best bet for you. – Lilith Daemon Apr 17 '16 at 21:59
  • Hey thank you for cooperation. I will still try to get some internet help while checking out more by myself. If I found a very good explanation or internal insight into this from a developer perspective, I'll post it here. – alphadog Apr 17 '16 at 22:03
  • Most GPUs sold today have much higher clock speeds than 400 MHz. A more typical clock speed for a high-end GPU is 1.2 GHz. So while GPUs do have slower clock speeds than CPUs, this is not the main reason why they execute a single instruction stream less quickly. There are other architectural differences that are the main reason, but explaining them would be too much to put in a comment. But I did want to emphasize that while clock speeds do differ, this is not the main reason why GPUs are slower than CPUs for serial code. – Gabriel Southern Apr 18 '16 at 02:51
  • Agreed with Gabriel. @Chris Britt, could you help me find a GPU clocked at 200 MHz. My laptop GPU reads 0.9 GHz. Also, todays CPU trend is to reduce frequency and increase core count, performance per core thus drops (see http://en.community.dell.com/techcenter/high-performance-computing/b/general_hpc/archive/2016/03/31/measuring-performance-of-intel-broadwell-processors-with-high-performance-computing-benchmarks ). – Florent DUGUET Apr 23 '16 at 15:55