Why can GPU do matrix multiplication faster than CPU?

Question

I've been using GPU for a while without questioning it but now I'm curious.

Why can GPU do matrix multiplication much faster than CPU? Is it because of parallel processing? But I didn't write any parallel processing code. Does it do it automatically by itself?

Any intuition / high-level explanation will be appreciated!

Yes because of massively parallel computation. You might have not written any parallel code, but tf or torch built-in modules are optimized to run on gpu (parallelized) — Umang Gupta, Jul 15 '18 at 00:01
I really don't understand people who downvoted or wanted this question to be closed. It's an important question to ask for some people. — aerin, Jul 15 '18 at 20:55
@Aaron will leave it closed, because the answer explains it - and follow-up questions concerning CuDa programming would rather be appropriate for SO. it is not that it would be a "bad" question, "too broad" means, that one could write a book about it — Martin Zeitler, Oct 09 '18 at 20:05

coder3101 · Accepted Answer · 2020-03-02T11:21:37.177

How do you parallelize the computations?

GPU's are able to do a lot of parallel computations. A Lot more than a CPU could do. Look at this example of vector addition of let's say 1M elements.

Using a CPU let's say you have 100 maximum threads you can run : (100 is lot more but let's assume for a while)

In a typical multi-threading example let's say you parallelized additions on all threads.

Here is what I mean by it :

c[0] = a[0] + b[0] # let's do it on thread 0
c[1] = a[1] + b[1] # let's do it on thread 1
c[101] = a[101] + b[101] # let's do it on thread 1

We are able to do it because value of c[0], doesn't depend upon any other values except a[0] and b[0]. So each addition is independent of others. Hence, we were able to easily parallelize the task.

As you see in above example that simultaneously all the addition of 100 different elements take place saving you time. In this way it takes 1M/100 = 10,000 steps to add all the elements.

How Efficient does GPU Parallelizes?

Now consider today's GPU with about 2048 threads, all threads can independently do 2048 different operations in constant time. Hence giving a boost up.

In your case of matrix multiplication. You can parallelize the computations, Because GPU have much more threads and in each thread you have multiple blocks. So a lot of computations are parallelized, resulting quick computations.

But I didn't write any parallel processing for my GTX1080! Does it do it by itself?

Almost all the framework for machine learning uses parallelized implementation of all the possible operations. This is achieved by CUDA programming, NVIDIA API to do parallel computations on NVIDIA GPU's. You don't write it explicitly, it's all done at low level, and you do not even get to know.

Yes it doesn't mean that a C++ program you wrote will automatically be parallelized, just because you have a GPU. No, you need to write it using CUDA, only then it will be parallelized, but most programming framework have it, So it is not required from your end.

I don't think your thread analogy is correct. Computations are CPU processor bound, not thread bound. Thus on a 4 core CPU with 2048 threads, you can only do 4 parallel mathematical operations _in parallel_. This goes up a bit with SIMD. However, a GPU is comprised of many many smaller processors, which means it can highly parallelise the computations. — Rambatino, Feb 11 '21 at 16:54
x86 processor have 2 threads per core, so a 4 core processor has 8 threads and if utilised efficiently they can all run in parallel. The above analogy of 100 CPU threads is realistic, in 64 core processor, you can in fact run 128 parallel threads. You can create as many threads you want say 2048 in a CPU as well but only 128 (on 64 core) of them will run in parallel rest will be concurrently executed. So I think it is not processor bound but number of threads a processor run in parallel. — coder3101, Feb 13 '21 at 02:40
For instance, Apple M1 has 1 thread per core, so 8 Core M1 can run only 8 threads. Clearly computation is not Core bound but total threads a processor can run in parallel. For simplicity ignore SIMD instructions. — coder3101, Feb 13 '21 at 02:45

aerin · Answer 2 · 2021-04-10T16:37:58.843

9

Actually this question led me to take Computer Architecture class from UW (Dr. Luis Ceze). Now I can answer this question.

To sum it up, it's because of the hardware specialization. We can tailor the chip architecture to balance between specialization and efficiency (more flexible vs more efficient). For example, GPU is highly specialized for parallel processing, while CPU is designed to handle many different kinds of operations.

In addition, FPGA, ASIC are more specialized than GPU. (Do you see blocks for processing units?)

edited Apr 10 '21 at 16:37

answered Apr 05 '20 at 15:35

aerin

20,607
28
102
140

In my understanding, FPGAs are more flexible than CPUs or GPUs (you can literally re-program the gates and memory to perform any hardware function in the field), yet less efficient (re-programmability resources (extra muxes and wiring) consume extra chip area and also affect speed, as signals must pass through more gates / routing). E.g. an FPGA can be reprogrammed to include CPUs, GPUs (and even sometimes it will **reprogram itself**). Lovely infographics but IMO I think FPGAs should be on the far left of the last image. – Ralph Jul 15 '22 at 01:33

score 2 · Answer 3 · answered Feb 20 '21 at 15:38

GPU design traditionally focuses on maximizing floating point units and doing multidimensional array operations. They were originally designed for graphics, and linear math is useful.

CPUs are optimized for general computing and single-threaded execution. Each execution unit is large and sophisticated.

Why can GPU do matrix multiplication faster than CPU?

3 Answers3

How do you parallelize the computations?

How Efficient does GPU Parallelizes?

Linked