glsl programming architecture which part is "really" parallel execution?

Question

I am trying to implement image processing algorithm like gaussian filtering, bilateral filtering in GPU using glsl.

And I am getting confused with which part is "really" parallel execution. for example, I have a 1280*720 preview as texture. I am not quite sure which part is really running for 1280*720 times and which part is not.

what's the dispatching mechanism of glsl codes?

my gaussian filtering code is like:

#extension GL_OES_EGL_image_external : require
precision mediump float;
varying vec2 vTextureCoord;
uniform samplerExternalOES sTexture;
uniform sampler2D sTextureMask;

void main() {

float r=texture2D(sTexture, vTextureCoord).r;
float g=texture2D(sTexture, vTextureCoord).g;
float b=texture2D(sTexture, vTextureCoord).b;

// a test sample
float test=1.0*0.5;

float width=1280.0;
float height=720.0;

vec4 sum;   

//offsets of a 3*3 kernel
vec2 offset0=vec2(-1.0,-1.0); vec2 offset1=vec2(0.0,-1.0); vec2 offset2=vec2(1.0,-1.0);
vec2 offset3=vec2(-1.0,0.0); vec2 offset4=vec2(0.0,0.0); vec2 offset5=vec2(1.0,0.0);
vec2 offset6=vec2(-1.0,1.0); vec2 offset7=vec2(0.0,1.0); vec2 offset8=vec2(1.0,1.0); 

//gaussina kernel with sigma==100.0;
float kernelValue0 = 0.999900; float kernelValue1 = 0.999950; float kernelValue2 = 0.999900;
float kernelValue3 = 0.999950; float kernelValue4 =1.000000; float kernelValue5 = 0.999950;
float kernelValue6 = 0.999900; float kernelValue7 = 0.999950; float kernelValue8 = 0.999900;

vec4 cTemp0;vec4 cTemp1;vec4 cTemp2;vec4 cTemp3;vec4 cTemp4;vec4 cTemp5;vec4 cTemp6;vec4 cTemp7;vec4 cTemp8;



//getting 3*3 pixel values around current pixel
vec2 src_coor_2;
src_coor_2=vec2(vTextureCoord[0]+offset0.x/width,vTextureCoord[1]+offset0.y/height);
cTemp0=texture2D(sTexture, src_coor_2);
src_coor_2=vec2(vTextureCoord[0]+offset1.x/width,vTextureCoord[1]+offset1.y/height);
cTemp1=texture2D(sTexture, src_coor_2);
src_coor_2=vec2(vTextureCoord[0]+offset2.x/width,vTextureCoord[1]+offset2.y/height);
cTemp2=texture2D(sTexture, src_coor_2);
src_coor_2=vec2(vTextureCoord[0]+offset3.x/width,vTextureCoord[1]+offset3.y/height);
cTemp3=texture2D(sTexture, src_coor_2);
src_coor_2=vec2(vTextureCoord[0]+offset4.x/width,vTextureCoord[1]+offset4.y/height);
cTemp4=texture2D(sTexture, src_coor_2);
src_coor_2=vec2(vTextureCoord[0]+offset5.x/width,vTextureCoord[1]+offset5.y/height);
cTemp5=texture2D(sTexture, src_coor_2);
src_coor_2=vec2(vTextureCoord[0]+offset6.x/width,vTextureCoord[1]+offset6.y/height);
cTemp6=texture2D(sTexture, src_coor_2);
src_coor_2=vec2(vTextureCoord[0]+offset7.x/width,vTextureCoord[1]+offset7.y/height);
cTemp7=texture2D(sTexture, src_coor_2);
src_coor_2=vec2(vTextureCoord[0]+offset8.x/width,vTextureCoord[1]+offset8.y/height);
cTemp8=texture2D(sTexture, src_coor_2);

//convolution
sum =kernelValue0*cTemp0+kernelValue1*cTemp1+kernelValue2*cTemp2+
    kernelValue3*cTemp3+kernelValue4*cTemp4+kernelValue5*cTemp5+
    kernelValue6*cTemp6+kernelValue7*cTemp7+kernelValue8*cTemp8; 

float factor=kernelValue0+kernelValue1+kernelValue2+kernelValue3+kernelValue4+kernelValue5+kernelValue6+kernelValue7+kernelValue8;

gl_FragColor = sum/factor;
//gl_FragColor=texture2D(sTexture, vTextureCoord);

}

this code is running with lower fps against pure preview on my phone(galaxy nexus).

but if I change the last part of my code to direct output with original pixel value, like

    //gl_FragColor = sum/factor;
gl_FragColor=texture2D(sTexture, vTextureCoord);

it would run fast and same fps as pure preview.

the quesion is: things I write for test and useless in the beginning like:

float test=1.0*0.5;

how many time is it executed?

other parts like:

sum =kernelValue0*cTemp0+kernelValue1*cTemp1+kernelValue2*cTemp2+
    kernelValue3*cTemp3+kernelValue4*cTemp4+kernelValue5*cTemp5+
    kernelValue6*cTemp6+kernelValue7*cTemp7+kernelValue8*cTemp8;

would not run 1280*720 times just when I change

gl_FragColor = sum/factor;

to gl_FragColor=texture2D(sTexture, vTextureCoord);?

how is the mechanism to decide which is to run 1280*720 times, which is just useless when parallel though out the pixels? is it done automatically?

what's the architecture, dispatching, how it organize the data to the GPU and other things for a glsl program?

I am wondering what should I do for more complicated operations like bilateral filtering and with kernel size like 9*9 and 9 times per pixel than this 3*3 gaussian kernel.

GLSL is not a low-level language, your assignment of r,g,b as three separate texture lookups is nothing but added verbiage. A sensible compiler should do a single fetch and assign the output to those three scalars. You might be surprised to know that texel fetch instructions often come in 1,2 and 4 component variants (4 being king); 3 component lookups are smoke and mirrors. As for "which part is really running 1280x720 times?" It all is, but many of those texture lookups refer to neighboring texels. You will get lucky and hit cache for many of them, with a small enough kernel size. — Andon M. Coleman, Sep 17 '13 at 22:48

datenwolf · Answer 1 · 2013-09-17T13:57:48.787

3

The entire fragment shader code is executed as a whole for each and every fragment. A fragment approximates either, if no antialiasing is done an output pixel, or with multisample antialiasing the samples of the framebuffer. What a fragment exactly is, is not specified in detail by the OpenGL spec, other than it's the output of the fragment stage which is then turned into values on the framebuffer bitplanes.

The rasterizer produces a series of framebuffer addresses and values using a two-dimensional description of a point, line segment, or polygon. Each fragment so produced is fed to the next stage that performs operations on individual fragments before they ﬁnally alter the framebuffer. These operations include

[OpenGL-3.3 core spec, section 2.4]

would not run 1280*720 times just when I change
gl_FragColor = sum/factor;
to
gl_FragColor=texture2D(sTexture, vTextureCoord);?

Division is a costly and complex operation. Since the sum of the kernel is a constant, and doesn't change per fragment you shouldn't evaulate that in the shader. Evaluate it on the CPU and supply 1./factor as a uniform (which is a constant equal for all fragments) and multiply that with sum which is much faster than division.

Your gaussian kernel is actually a 3×3 matrix, for which there is a dedicated type in GLSL. The calculations you perform can be rewritten in terms of dot products (mathematically correct term would be scalar or inner product), for which GPUs have dedicated, accelerated instructions.

Also you shouldn't split up the components of a texture into individual floats.

All in all you built quite a number of speed bumps into your code.

edited Sep 17 '13 at 13:57

answered Sep 17 '13 at 11:45

datenwolf

159,371
13
185
298

Meh, with multisampling the fragment shader still isn't executed for each and every sample (which is the whole point of multisampling). The fragment being composed of multiple samples is a better explanation, I think. And neither is a fragment neccessarily an output pixel. I'm sure this isn't so relevant to the question and you already know all this, but it doesn't match the accuracy known from other answers of yours. – Christian Rau Sep 17 '13 at 12:53
@ChristianRau: I didn't want to overcomplicate things. Yes a fragment doesn'r really correspond to a pixel nor a multisampling sample. The specification isn't very clear on what a fragment actually is, other than the result of the fragment shader which gets processed into (multisampled) bitplane values of the framebuffer. Which, when thinking about prallelization and performance, doesn't really help. If fragment shader performance and parallelization is the primary concern, then associating fragments with pixels or multisample fragments is a good approximation. – datenwolf Sep 17 '13 at 13:54
Well, in fact you could have just left away all the stuff regarding antialiasing if not wanting to overcomplicate things. I agree that a spec-explanation of fragments is probalby overkill here, but nevermind. – Christian Rau Sep 17 '13 at 14:11
thanks for your answer, yeap, the kernel factors could be set constant. I know that 3*3 kernel could be done as glsl mat3 But I may change it to 5*5 or even 9*9 in the future. I try to reduce those speed bumps as you said, but the program is still running slow. then I try reducing my kernel to a 1*2 or 1*3 dummy one and write like gl_FragColor = sum/factor, it's running pretty fast. I am quite sure the latency comes from I am doing too much per pixel/fragment, not the division or other speed bumps. – flankechen Sep 18 '13 at 05:47
well, system split my comments, and continue. I just change the last statement gl_FragColor = "something" and the program runs in quite different frame rate. if so, some codes has not run for 1280*720 times. and gl_FragColor = "something" could fetch what's related to the operation/outputs and optimize the code in the run time? – flankechen Sep 18 '13 at 05:55
If you can structure your algorithm to use `textureOffset (...)` with constant offsets that can improve performance too in some cases. It is all about getting the GLSL compiler to understand what you are doing so it can go about choosing the proper instructions and scheduling things as efficiently as possible. In the end, that's how it works with shading languages; you have to be smarter than the compiler if you want to get anywhere by micro-optimizing code in a high-level language. – Andon M. Coleman Sep 18 '13 at 20:09

score 1 · Answer 2 · edited Apr 19 '20 at 19:33

On a modern (Shader Model 3.0+) GPU, fragment shaders are scheduled to operate on 2x2 blocks of pixels (pixel quads) at a time. Fun fact, this was required in order to implement the derivative instruction in Shader Model 3.0 and it has remained part of GPU architecture design ever since. Pixel quads are the lowest-level of granularity you can ever get in fragment shader scheduling. In fact, if you were to discard in a fragment shader, unless all of the fragments in the pixel quad also discard, then every instance of the fragment shader in the block continues running and the result is thrown out at the end for the individual fragments that requested discard.

In addition to this, most GPUs have multiple stream processing units and will schedule pixel quads into larger workgroups (NV calls them warps, AMD calls them wavefronts). In a nutshell, everything is happening in parallel, that is the entire premise of GPUs - they apply a single task across multiple threads that all operate on the same data in parallel; this is why they scale so well when cores are increased as opposed to CPUs.

Put simply, rather than dispatching individual instructions in your GLSL shader to run on separate functional units, what really happens is this. Your GLSL shader is run on multiple processing units simultaneously (conceptually, one thread per-fragment), and these threads all execute the same sequence of instructions in a paradigm known as SIMT (Single Instruction Multiple Thread).

Getting back to the basic scheduling unit (warp/wavefront), if one instance of your shader stalls fetching memory the rest of the instances in said scheduling unit also stall, because they all run the same instruction simultaneously. This is why dependent texture reads and large filter kernels are bad mojo; since the texture memory needed by a particular group of fragments may be indeterminate until run-time or spread too far, efficiently pre-fetching and caching texture data within a scheduling unit can become difficult if not impossible.

The biggest problem with accurately describing the level of parallelism is that the GPU architectures keep changing (most of the discussion above related to Shader Model 3.0+ GPUs). Not too long ago, GPUs had vectorized ISAs but now both AMD and NV have switched to superscalar because it actually improves instruction scheduling efficiency. Throw specialized embedded GPUs into the mix and you have a real nightmare on your hands, it is hard to really say what shader model they run (since derivative is optional in OpenGL ES 2.0).

See this other question on Stack Overflow for a more concise statement of what I just wrote.

For some pretty diagrams, here is a somewhat out of date, but still useful presentation from nVIDIA.

thanks a lot for the explanation.well I seems to understand a little bit more. however, referring to my code examples. I just change the last statement gl_FragColor = "something" and the program runs in quite different frame rate. if so, some codes has not run for 1280*720 times? and gl_FragColor = "something" could fetch what's related to the operation/outputs and optimize the code in the run time? — flankechen, Sep 18 '13 at 08:56
Just a little nitpicking: That 2×2 scheduling is nowhere specified as the "definitive" way to do it. A OpenGL implementation is free to implement it any way it sees fit. Also partial derivatives must be also available between fragments of adjacent blocks; consider a block with every but one fragment discarded; that one remaining fragment still demands a partial derivative. The only thing the GPU can do then is looking at the other blocks. But there's actually some of the 2×2 blocks in which fragments are processed imprinted on OpenGL, namely GLSL. The `textureGather` function. — datenwolf, Sep 18 '13 at 22:36
Good point, I was speaking generally for current generation desktop GPU hardware. I suppose I went a little overboard by throwing GLSL requirements out the window and only discussing actual hardware implementations. Particularly since this is question was related to embedded OpenGL, whose hardware I am not as familiar with. I tend to think of GLSL as a language that has evolved around commodity desktop GPUs, but as you mention hardware/software implementations of the language need not be constrained by the design of desktop GPUs. Especially with compute shaders blurring the line... — Andon M. Coleman, Sep 18 '13 at 23:09

glsl programming architecture which part is "really" parallel execution?

2 Answers2