2

I had this idea for something "intrinsic-like" on OpenGL, but googeling around brought no results.

So basically I have a Compute Shader for calculating the Mandelbrot set (each thread does one pixel). Part of my main-function in GLSL looks like this:

float XR, XI, XR2, XI2, CR, CI;
uint i;
CR = float(minX + gl_GlobalInvocationID.x * (maxX - minX) / ResX);
CI = float(minY + gl_GlobalInvocationID.y * (maxY - minY) / ResY);
XR = 0;
XI = 0;
for (i = 0; i < MaxIter; i++)
{
    XR2 = XR * XR;
    XI2 = XI * XI;
    XI = 2 * XR * XI + CI;
    XR = XR2 - XI2 + CR;
    if ((XR * XR + XI * XI) > 4.0)
    {
        break;
    }
}

So my thought was using vec4's instead of floats and so doing 4 calculations/pixels at once and hopefully get a 4x speed-boost (analog to "real" CPU-intrinsics). But my code seems to run MUCH slower than the float-version. There are still some mistakes in there (if anyone would still like to see the code, please say so), but I don't think they are what slows down the code. Before I try around for ages, can anybody tell me right away, if this endeavour is futile?

Yakov Galka
  • 70,775
  • 16
  • 139
  • 220
Paul Aner
  • 361
  • 1
  • 8
  • Well, how do you implement it? Neighboring pixels won't necessarily exit the loop at the same time. So how does that work? – Nicol Bolas Dec 31 '22 at 20:44
  • AFAIK modern GPUs use a super-scalar architecture for their individual cores. That means that using a vec4 will not speed anything up. – Yakov Galka Dec 31 '22 at 20:47
  • I did this same thing once with CPU-intrinsics and it actually sped up the code by about 4 times. Yes, neighboring pixels won't necessarily exit the loop at the same time, but that didn't slow down the code. So I would expect that this should work on the GPU. At least Chat-GPT says that vectorizing the code MIGHT speed things up ( :-b ). So @Yakov Galky: Are you sure vectorizing will never speed anything up? – Paul Aner Dec 31 '22 at 21:04

1 Answers1

3

CPUs and GPUs work quite differently.

CPUs need explicit vectorization in the machine code, either coded manually by the programmer (through what you call 'CPU-intrisnics') or automatically vectorized by the compiler.

GPUs, on the other hand, vectorize by means of running multiple invocations of your shader (aka kernel) on their cores in parallel.

AFAIK, on modern GPUs, additional vectorization within a thread is neither needed nor supported: instead of manufacturing a single core that can add 4 floats per clock (for example), it's more beneficial to have four times as many simpler cores, each of them being able to add a single float per clock. This way you still get the same peak FLOPS for the entire chip, while at the same time enabling full utilization of the circuitry even when the individual shader code cannot be vectorized. The thing is that most code, by means of necessity, will have at least some scalar computations in it.

The bottom line is: it's likely that your code already squeezes the most out of the GPU as possible for this specific task.

Yakov Galka
  • 70,775
  • 16
  • 139
  • 220
  • OK, thank you for your answer. I will have to look if I can optimize some other way. Can you give me gerneral a hint (does this make sense?) on how to "buffer" a SSBO input array (this would be for a Compute Shader using perturbation theory, so there is a reference-point array, that needs to be read every iteration)? I tried on some other code to read something like this into a shared array with the first thread using a memory barrier. On CUDA I did something similar once and it worked great. On OpenGL the code worked, but was EXACTLY as fast as the "non-buffered" approach. – Paul Aner Jan 01 '23 at 04:43
  • @PaulAner i'm not sure i understand your question; but since this question is not about SSBO, I recommend that you open a new one and post some example of what you mean. – Yakov Galka Jan 01 '23 at 05:01
  • @Paul IIRC, some GPUs have SIMD within a 32-bit or 64-bit chunk (the width of a GPU "lane"), like packed 8-bit add. But more normally, intrinsics for GPUs are just scalar operations that plain C doesn't have operators for, e.g. fast approximate reciprocal square root (`rsqrt` https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#intrinsic-functions), or integer operations like SAD, byte-reverse (32 or 64-bit) https://docs.nvidia.com/cuda/cuda-math-api/group__CUDA__MATH__INTRINSIC__INT.html or stuff the HW can do faster than plain multiply like `__mul24` (32-bit product of low-24) – Peter Cordes Jan 01 '23 at 07:34
  • 1
    @PaulAner: Ok yeah, found a page about CUDA SIMD intrinsics, for multiple narrow integer elements within an `unsigned int`: https://docs.nvidia.com/cuda/cuda-math-api/group__CUDA__MATH__INTRINSIC__SIMD.html#group__CUDA__MATH__INTRINSIC__SIMD like `__vavgs4` which does `(a+b)>>1` separately within each of the 4 bytes of a uint32_t. I have no idea if those operations are explicitly accessible via OpenGL, or if it's just up to the OpenGL shader compiler to use those hardware operations for you when your shader is operating on 8-bit data, i.e. "auto vectorizing". – Peter Cordes Jan 01 '23 at 07:37
  • @PeterCordes Re CUDA SIMD: though it's interesting that such SIMD operations are provided, they aren't very useful for doing single-precision arithmetics :). Thank you for the information though! Re rsqrt...__mul24: I think the OP uses 'CPU-intrinsics' in a very narrow sense referring to the SSE SIMD as exposed to C++ programs through intrinsics. – Yakov Galka Jan 01 '23 at 07:44
  • 1
    @YakovGalka: Agreed about what the OP was asking, and your answer is correct that GPU SIMD is very different from manually vectorizing for modern CPUs with SSE/AVX/NEON style fixed-width short-vector SIMD using 128-bit to 512-bit where one instruction operates on 4 to 16 floats in parallel, reducing front-end bandwidth per FLOP. GPUs just have many simple instruction pipelines in parallel so they don't need to worry about shuffling within SIMD vectors like CPUs do. My comments were always intended as a footnote / fun fact addition, not a correction. – Peter Cordes Jan 01 '23 at 07:51
  • 1
    @PaulAner I haven't really seen anyone performing vectorization inside a shader. Shader invocations are already vectorized. I also wouldn't know of any hardware support for things similar to SSE/AVX. About ChatGPT: That's the problem with chat bots. They get a lot of stuff right. And then they produce a lot of stuff that just sounds right but is nonsense. Since the bot doesn't tell you where the got the information from or why they think they are correct, it's impossible to judge the correctness. – BDL Jan 01 '23 at 11:16
  • @BDL That is of course correct. The thing is: 90% of what Chat-GPT says DOES work / IS correct (50% of the cases it can even explain a joke to you). And I wouldn't consider it a chat bot. It IS the most suffisticated language-based A.I. that "creates" information/code that did not exist before. I don't want to advertise it too much (I am not affiliated with Open-AI in any way) ;-), but in a lot of cases it is actually a REALLY nice and helpful assistant. You can let it create some easier (but complicated/bothersome) code or even ask very specific programming question you can't really google... – Paul Aner Jan 01 '23 at 11:27
  • @PaulAner you do sound like some covert advertisement for Chat-GPT though. The thing is, if you're here to learn, there's no substitute to use your own brains. If instead you want us to explain what Chat-GPT "had in mind", then I have to say this: how about you go and ask that "most suffisticated artificial intelligence" what it meant, and have it explain to you with code and examples. – Yakov Galka Jan 01 '23 at 18:30
  • @YakovGalka - Oh, come on! No advertisement intended. Still GPT is a really nice tool. Of course you can't just take it's code as is and should think about it. And by the way: This https://stackoverflow.com/questions/39490845/glsl-scalar-vs-vector-performance says vectorization CAN be - in some cases - faster. I did not want you guys to explain code to me - I understand it perfectly. My question was: CAN/SHOULD vectorization be faster on GPUs. – Paul Aner Jan 01 '23 at 19:11
  • 1
    @BDL it's quite possible to judge the correctness of ChatGPT's output... if you understand the underlying subject matter well (in which case you had no need to consult with ChatGPT in the first place :) ) – Jeremy Friesner Jan 01 '23 at 19:41
  • 1
    @PaulAner: 99% of what ChatGPT says *about computer architecture* is (subtly) wrong, though. For details of an example, see [my comments on meta](https://meta.stackoverflow.com/questions/422066/why-was-my-answer-deleted-for-using-chatgpt-even-when-i-didnt/422067#comment938857_422067) - it makes false statements to support its conclusions. You don't have 10k rep to see deleted answers, but [C# and SIMD: High and low speedups. What is happening?](https://stackoverflow.com/q/56951793) has been getting regular ChatGPT dumps, ranging from generic/useless to specific enough to be wrong. – Peter Cordes Jan 02 '23 at 01:35
  • 1
    @PaulAner: ChatGPT doesn't "understand" what it's saying, so probably tends to string words together like SIMD intrinsics being beneficial, not realizing that the context is different between OpenGL shaders vs. programming for a CPU. So maybe not 99% of what it says is wrong, but all the SO answers I've seen in [assembly] / [cpu-architecture] / [simd] that were dumps of ChatGPT output have had at least one serious mis-statement, often which could seriously mislead someone who didn't already know better. In some cases as the basis for the whole answer. – Peter Cordes Jan 02 '23 at 01:37
  • 1
    @PaulAner: e.g. one of the things ChatGPT said on that C# SIMD Q&A was *if the data is too large, the SIMD code may not be able to fully utilize the available SIMD registers, resulting in reduced performance.* That's nonsense; big arrays makes unrolling with more registers *more* worthwhile. It could be true of SIMD *execution units*, because the bottleneck is memory bandwidth with huge data that won't be hot in cache, but that's not what ChatGPT said. Or maybe it meant if each element is too large, like `uint128`, you can't use SIMD at all? But that's different from "can't fully utilize" – Peter Cordes Jan 02 '23 at 01:44
  • @PeterCordes I really didn't mean to start a ChatGPT-discussion here. And yes - even in my experience there is in most cases some subtle thing wrong it says or in the code it produces. Non the less - I found it very helpful a lot of the time. And be it just as "a better google" or to give a starting point for some (complicated) code. Apparently I shouldn't give anything to it saying vectorization CAN give a speed boost on GPU (and I wish I wouldn't have mentioned it). So I think, we should stop the GPU-discussion here... – Paul Aner Jan 02 '23 at 09:56
  • @PaulAner: Interesting, using its output to find some technical terms to search on sounds useful. I just wanted to make it clear that any specific statement from ChatGPT (like vectorization speedups on GPUs) should be taken with a huge grain of salt; it's totally normal for ChatGPT to just invent wrong ideas when it comes to conceptual explanations or specifics / details. And it turns out this is one of those cases. I'm not saying it's impossible to derive value from ChatGPT, or crapping on people for using it. – Peter Cordes Jan 02 '23 at 10:22
  • 1
    why does anyone think ChatGPT is a reliable source of anything??? – user253751 Jan 02 '23 at 21:01