What is the difference between CUDA core and CPU core?

Question

I worked a bit with CUDA, and a lot with the CPU, and i'm trying to understand what is the difference between the two. My I5 processor has 4 cores and cost $200 and my NVidia 660 has 960 cores and cost about the same.

I would be really happy if someone could explain what are the key differences between the two processing units architecture in terms of abilities pros and cons. For example, does a CUDA core have branch prediction?

CPU cores are general purpose, GPU cores are usually very specific. — Matthew, Jan 07 '14 at 16:19
I understand that, but i want to understand in more depth what makes GPU cores very specific and what they can't do that CPU cores can do. — OopsUser, Jan 07 '14 at 16:23
talonmies When you developing a software that has the potential to be multithreadable it is important to understand the pros and cons of the architectures you can use so you will choose better, and not find out after a month of development that although Cuda has many processor, each one of them has some disadvantage that can be crucial for you performance and not suitable for your problem. — OopsUser, Jan 07 '14 at 16:34
Although it's not a question that ask about specific programming problem, it does belong here as it answer may help programmers (like me) to choose the appropriate infrastructure for the software they are writing. — OopsUser, Jan 07 '14 at 16:47
@OopsUser I recommend you review [GPU Architecture](http://on-demand.gputechconf.com/gtc/2013/presentations/S3466-Programming-Guidelines-GPU-Architecture.pdf) slides 3-25 which will give you an idea how NVIDIA Kepler SMX execute instructions. — Greg Smith, Jan 07 '14 at 19:15

score 11 · Accepted Answer · edited May 18 '17 at 16:19

It is a computer Architecture question which entails a long answer. I will try to keep it very simple on the risk of being inaccurate. You basically self-answered your question by asking do CUDA core handle branch prediction, the answer is NO. A CPU core has to handle each single operation a computer does, calculation, memory fetching, IO, interrupts, therefore it has a huge complex instruction set, and to optimize the speed of fetching instruction branch prediction is used.
Also it has a big cache and fast clock rate. To implement the instruction set you need more logic thus more transistors more cost per core compared to the GPU.

The GPU cores have less cache memory, simpler instruction and less clock rate per clock, however they are optimized to do more calculation as a group. The simple instructions set, the less cache memory makes them less expensive per core.

Every single sentence of your answer is incorrect. The int32 and fp32 cores have exactly the same amount of cache. And cuda cores do NOT fetch memory, the multiprocessor has load/store units for that. There is no concept of interrupts on a GPU. And int32 and fp32 cores operate in lockstep, the literally operate on the same clock. — Johan, Aug 23 '23 at 09:49

score 6 · Answer 2 · edited May 23 '17 at 11:47

Cuda cores are more lanes of a vector unit, gathered into warps. In essence cuda cores are entries in a wider AVX or VSX or NEON vector.

The closest to a CPU core is an SMX. It can handle multiple contexts (warps, hyper threading, SMT), and has several parallel execution pipelines (6 FP32 for Kepler, 2 on Haswell, 2 on Power 8). And each SMX is independent, just as any core or a general purpose CPU.

This analogy is detailed further here: https://stackoverflow.com/a/36812922/6218300.

score 4 · Answer 3 · edited Apr 22 '19 at 13:48

They are now in principle the same as CPU cores. It isn't long ago that this wasn't true for example they have been unable to process integer operands in 2005.

When comparing the CPU Core complexity of your 2-core i5 keep in mind that the original 80386 CPU had just about 275K transistors while a Core2Duo has about 230 Million. 1000 times more, so the numbers fit well.

The biggest difference is the memory handling which becomes even more complicated then the good old days when we need segmentation registers. There is no virtual memory etc. and it is the very thin bottleneck when you try to port your normal CPU programs but the real problem is that non local memory access is very expensive 400-800 cycles. They are using a technique that outside the GPU world only the SUN Niagara T1/T2 general purpose CPU had. While waiting for a memory access they schedule different set of threads with other instructions that are ready (called wraps). But if all the threads do is non-linear jumping around your data your performance just fails.

Downvoting without telling what is wrong is typical cowardness of modern stackoverflow and a reason why this once great website is dying. The memory access is the problem. — Lothar, May 17 '16 at 20:22

score 3 · Answer 4 · answered Jan 08 '14 at 11:16

You need to understand the fundamental difference between CPU Vs GPU and the need for rise of GPGPU in recent tims. One of the informative course on this is available in Udacity

Also, this book might be helpful for beginner level programs.

Though this is not a programming question. Hope it might help someone.

What is the difference between CUDA core and CPU core?

4 Answers4

Linked