How do I know how many matrix operations a GPU can do in parallel?

Question

I'm using a JS library called GPU.js. Used like so:

const gpu = new GPU();
const multiplyMatrix = gpu.createKernel(function(a, b) {
    let sum = 0;
    for (let i = 0; i < 512; i++) {
        sum += a[this.thread.y][i] * b[i][this.thread.x];
    }
    return sum;
}).setOutput([512, 512]);

But since I work with the GPU not through a low level protocol like CUDA or OpenGL, but through a few layers of abstraction, namely GPU.js on top of WebGL, I didn't really have to learn the lower level fundamentals of how exactly the matrix operations get assembled on the hardware.

But I notice that with GPU.js, each GPU has a limit to how large of a matrix I can operate on, usually limited to the maximum screen resolution the GPU supports. So if I had to guess, I would think the maximum number of matrix operations I can execute at one time in parallel on a GPU is 7680 x 4320 x 3 (width x height x 3 color channels), with the RTX 3080 for example:

So I'd guess my limit on that card would be:

.setOutput([7680, 4320, 3]);

Edit:

This can't be right since the max resolution spec on every gen of Nvidia GPUs: 1000, 2000, 3000 series have all been constant, and the clock speed has stayed nearly the same as well, it's the CUDA core count that's increased, and it would seem that would be increasing the max number of concurrent matrix ops the card is capable of per second, based on the number of threads per core (ref 7m52s), but even looking at the docs I'm not sure how to figure out what that is, or if it's even that simple.

How can I figure my maximum matrix operation size that the GPU can handle in one parallel pass?

J.Todd · Answer 1 · 2020-09-14T02:24:27.107

0

It seems that

gl.getParameter(gl.MAX_TEXTURE_SIZE)

may be the correct answer, but I'm still not sure how we can find out how to calculate that for cards by their documentation. It seems like it would be cuda core count * thread count per core based on the architecture (7m52s).

edited Sep 14 '20 at 02:24

answered Sep 13 '20 at 21:25

J.Todd

707
1
12
34

The maximum display resolution reported on the spec-sheet of a GPU does not necessarily concur with the maximum texture resolution, which is what seems to be used to output kernel data, the library seems to expose a "maxTexSize" member *somewhere* (the doc seems broken), for webgl code you'd use `gl.getParameter(gl.MAX_TEXTURE_SIZE)` to query the respective information. – LJᛃ Sep 13 '20 at 23:37
@LJᛃ I thought something was weird. What doesnt make sense to me about the whole thing is that the max resolution spec on every gen of Nvidia GPU 1000, 2000, 3000 series have all been constant, and the clock speed has stayed nearly the same as well, it's the CUDA core count that's increased, and it would seem that would be increasing the max number of concurrent matrix ops the card is capable of per second, not the max supported resolution spec of the card... – J.Todd Sep 14 '20 at 00:02
1

@LJᛃ But what Im still not clear on is what specs for the card I can look at to know what its max concurrent matrix ops are. Like `gl.MAX_TEXTURE_SIZE` may tell us the correct answer, but what about just by looking at the card specs? Seems like it would be CUDA core count * number of threads per core, but Im not sure where to find number of threads per CUDA core for each card architecture. Edit: The answer may be here somewhere: https://cuda.readthedocs.io/ko/latest/rtx2080Ti/ – J.Todd Sep 14 '20 at 00:03
GPU.js will run your code using webgl which in turn does not have native support for any kind of "compute" features, it exposes two features that can be *abused* to do compute like workloads(fragment shading and transform feedback) both of which are deeply nested in a configurable _rendering_ pipeline with a lot more going on. There is no way to derive the available concurrency in this scenario from any spec sheet or webgl value. Even utilizing something like CUDA, in a desktop environment you wont get all the resources of the card because the OS and other processes still need their share. – LJᛃ Sep 14 '20 at 09:39
@LJᛃ are you saying since GPU.js is built on the rendering pipeline it isn't capable of leveraging the full calculation performance of the GPU? It seems like if your goal is to run logic on matrix data, the GPU's full power would be brought to bear through the rendering process, and the math / logic we can do in shaders would expose pretty much most of the GPU's computing power. Could you explain why that wouldnt be the case, like maybe documentation I could read on features that would expose more matrix operation power outside the render pipeline? – J.Todd Sep 15 '20 at 11:06
1

Any rendering pipeline is an abstraction on top of the computing hardware a GPU has, any abstraction adds overhead and makes generalized assumptions leading to inefficiencies, hence the closer you get to the hardware the more efficient you can utilize its resources. In the context of the browser there's no way around using webgl(but not necessarily GPU.js) to utilize the GPU, outside of that there's quite a few technologies available, most of which are listed in the [GPGPU wikipedia article](https://en.wikipedia.org/wiki/General-purpose_computing_on_graphics_processing_units#Implementations) – LJᛃ Sep 15 '20 at 16:09
@LJᛃ right. If I wanted optimal I'd probably be best off learning some CUDA :P – J.Todd Sep 15 '20 at 16:10
1

actually you'd target AMD GPUs and write [GCN assembly](https://www.reddit.com/r/ROCm/comments/akxmct/gcn_inline_assembly/) ;P – LJᛃ Sep 15 '20 at 18:01

How do I know how many matrix operations a GPU can do in parallel?

1 Answers1