I am enrolled in shaders course and interested in computer vision and image processing. I was wondering how can I mix GLSL shaders knowledge with image processing? What do I gain if I implement image processing algorithms with GLSL?
3 Answers
Case study: real time box blur on CPU vs GPU fragment shader
I have implemented a simple box blur https://en.wikipedia.org/wiki/Box_blur algorithm on CPU and GPU fragment shader to see which was faster:
- demo video https://www.youtube.com/watch?v=MRhAljmHq-o
- source code: https://github.com/cirosantilli/cpp-cheat/blob/cfd18700827dfcff6080a938793b44b790f0b7e7/opengl/glfw_webcam_image_process.c
My camera refresh rate capped FPS to 30, so I measured how wide the box could be and still keep 30 FPS.
On a Lenovo T430 (2012), NVIDIA NVS5400, Ubuntu 16.04 with image dimensions 960x540, the maximum widths were:
- GPU: 23
- CPU: 5
Since the computation is quadratic, the speedup was:
( 23 / 5 ) ^ 2 = 21.16
faster on GPU than CPU!
Not all algorithms are faster on the GPU. For example, operations that act on single pictures like swapping RGB reach 30FPS on the CPU, so it is useless to add the complexity of GPU programming to it.
Like any other CPU vs GPU speedup question, it all comes down if you have enough work per byte transferred to the GPU, and benchmarking is the best thing you can do. In general, quadratic algorithms or worse are a good bet for the GPU. See also: What do the terms "CPU bound" and "I/O bound" mean?
Main part of the code (just clone from GitHub):
#include "common.h"
#include "../v4l2/common_v4l2.h"
static const GLuint WIDTH = 640;
static const GLuint HEIGHT = 480;
static const GLfloat vertices[] = {
/* xy uv */
-1.0, 1.0, 0.0, 1.0,
0.0, 1.0, 0.0, 0.0,
0.0, -1.0, 1.0, 0.0,
-1.0, -1.0, 1.0, 1.0,
};
static const GLuint indices[] = {
0, 1, 2,
0, 2, 3,
};
static const GLchar *vertex_shader_source =
"#version 330 core\n"
"in vec2 coord2d;\n"
"in vec2 vertexUv;\n"
"out vec2 fragmentUv;\n"
"void main() {\n"
" gl_Position = vec4(coord2d, 0, 1);\n"
" fragmentUv = vertexUv;\n"
"}\n";
static const GLchar *fragment_shader_source =
"#version 330 core\n"
"in vec2 fragmentUv;\n"
"out vec3 color;\n"
"uniform sampler2D myTextureSampler;\n"
"void main() {\n"
" color = texture(myTextureSampler, fragmentUv.yx).rgb;\n"
"}\n";
static const GLchar *vertex_shader_source2 =
"#version 330 core\n"
"in vec2 coord2d;\n"
"in vec2 vertexUv;\n"
"out vec2 fragmentUv;\n"
"void main() {\n"
" gl_Position = vec4(coord2d + vec2(1.0, 0.0), 0, 1);\n"
" fragmentUv = vertexUv;\n"
"}\n";
static const GLchar *fragment_shader_source2 =
"#version 330 core\n"
"in vec2 fragmentUv;\n"
"out vec3 color;\n"
"uniform sampler2D myTextureSampler;\n"
"// pixel Delta. How large a pixel is in 0.0 to 1.0 that textures use.\n"
"uniform vec2 pixD;\n"
"void main() {\n"
/*"// Identity\n"*/
/*" color = texture(myTextureSampler, fragmentUv.yx ).rgb;\n"*/
/*"// Inverter\n"*/
/*" color = 1.0 - texture(myTextureSampler, fragmentUv.yx ).rgb;\n"*/
/*"// Swapper\n"*/
/*" color = texture(myTextureSampler, fragmentUv.yx ).gbr;\n"*/
/*"// Double vision ortho.\n"*/
/*" color = ("*/
/*" texture(myTextureSampler, fragmentUv.yx ).rgb +\n"*/
/*" texture(myTextureSampler, fragmentUv.xy ).rgb\n"*/
/*" ) / 2.0;\n"*/
/*"// Multi-me.\n"*/
/*" color = texture(myTextureSampler, 4.0 * fragmentUv.yx ).rgb;\n"*/
/*"// Horizontal linear blur.\n"*/
/*" int blur_width = 21;\n"*/
/*" int blur_width_half = blur_width / 2;\n"*/
/*" color = vec3(0.0, 0.0, 0.0);\n"*/
/*" for (int i = -blur_width_half; i <= blur_width_half; ++i) {\n"*/
/*" color += texture(myTextureSampler, vec2(fragmentUv.y + i * pixD.x, fragmentUv.x)).rgb;\n"*/
/*" }\n"*/
/*" color /= blur_width;\n"*/
/*"// Square linear blur.\n"*/
" int blur_width = 23;\n"
" int blur_width_half = blur_width / 2;\n"
" color = vec3(0.0, 0.0, 0.0);\n"
" for (int i = -blur_width_half; i <= blur_width_half; ++i) {\n"
" for (int j = -blur_width_half; j <= blur_width_half; ++j) {\n"
" color += texture(\n"
" myTextureSampler, fragmentUv.yx + ivec2(i, j) * pixD\n"
" ).rgb;\n"
" }\n"
" }\n"
" color /= (blur_width * blur_width);\n"
"}\n";
int main(int argc, char **argv) {
CommonV4l2 common_v4l2;
GLFWwindow *window;
GLint
coord2d_location,
myTextureSampler_location,
vertexUv_location,
coord2d_location2,
pixD_location2,
myTextureSampler_location2,
vertexUv_location2
;
GLuint
ebo,
program,
program2,
texture,
vbo,
vao,
vao2
;
unsigned int
cpu,
width,
height
;
uint8_t *image;
float *image2 = NULL;
/*uint8_t *image2 = NULL;*/
if (argc > 1) {
width = strtol(argv[1], NULL, 10);
} else {
width = WIDTH;
}
if (argc > 2) {
height = strtol(argv[2], NULL, 10);
} else {
height = HEIGHT;
}
if (argc > 3) {
cpu = (argv[3][0] == '1');
} else {
cpu = 0;
}
/* Window system. */
glfwInit();
glfwWindowHint(GLFW_RESIZABLE, GL_FALSE);
window = glfwCreateWindow(2 * width, height, __FILE__, NULL, NULL);
glfwMakeContextCurrent(window);
glewInit();
CommonV4l2_init(&common_v4l2, COMMON_V4L2_DEVICE, width, height);
/* Shader setup. */
program = common_get_shader_program(vertex_shader_source, fragment_shader_source);
coord2d_location = glGetAttribLocation(program, "coord2d");
vertexUv_location = glGetAttribLocation(program, "vertexUv");
myTextureSampler_location = glGetUniformLocation(program, "myTextureSampler");
/* Shader setup 2. */
const GLchar *fs;
if (cpu) {
fs = fragment_shader_source;
} else {
fs = fragment_shader_source2;
}
program2 = common_get_shader_program(vertex_shader_source2, fs);
coord2d_location2 = glGetAttribLocation(program2, "coord2d");
vertexUv_location2 = glGetAttribLocation(program2, "vertexUv");
myTextureSampler_location2 = glGetUniformLocation(program2, "myTextureSampler");
pixD_location2 = glGetUniformLocation(program2, "pixD");
/* Create vbo. */
glGenBuffers(1, &vbo);
glBindBuffer(GL_ARRAY_BUFFER, vbo);
glBufferData(GL_ARRAY_BUFFER, sizeof(vertices), vertices, GL_STATIC_DRAW);
glBindBuffer(GL_ARRAY_BUFFER, 0);
/* Create ebo. */
glGenBuffers(1, &ebo);
glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, ebo);
glBufferData(GL_ELEMENT_ARRAY_BUFFER, sizeof(indices), indices, GL_STATIC_DRAW);
glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, 0);
/* vao. */
glGenVertexArrays(1, &vao);
glBindVertexArray(vao);
glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, ebo);
glBindBuffer(GL_ARRAY_BUFFER, vbo);
glVertexAttribPointer(coord2d_location, 2, GL_FLOAT, GL_FALSE, 4 * sizeof(vertices[0]), (GLvoid*)0);
glEnableVertexAttribArray(coord2d_location);
glVertexAttribPointer(vertexUv_location, 2, GL_FLOAT, GL_FALSE, 4 * sizeof(GLfloat), (GLvoid*)(2 * sizeof(vertices[0])));
glEnableVertexAttribArray(vertexUv_location);
glBindVertexArray(0);
/* vao2. */
glGenVertexArrays(1, &vao2);
glBindVertexArray(vao2);
glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, ebo);
glBindBuffer(GL_ARRAY_BUFFER, vbo);
glVertexAttribPointer(coord2d_location2, 2, GL_FLOAT, GL_FALSE, 4 * sizeof(vertices[0]), (GLvoid*)0);
glEnableVertexAttribArray(coord2d_location2);
glVertexAttribPointer(vertexUv_location2, 2, GL_FLOAT, GL_FALSE, 4 * sizeof(GLfloat), (GLvoid*)(2 * sizeof(vertices[0])));
glEnableVertexAttribArray(vertexUv_location2);
glBindVertexArray(0);
/* Texture buffer. */
glGenTextures(1, &texture);
glBindTexture(GL_TEXTURE_2D, texture);
glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_NEAREST);
glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_NEAREST);
/* Constant state. */
glViewport(0, 0, 2 * width, height);
glClearColor(1.0f, 1.0f, 1.0f, 1.0f);
glActiveTexture(GL_TEXTURE0);
/* Main loop. */
common_fps_init();
do {
/* Blocks until an image is available, thus capping FPS to that.
* 30FPS is common in cheap webcams. */
CommonV4l2_updateImage(&common_v4l2);
image = CommonV4l2_getImage(&common_v4l2);
glClear(GL_COLOR_BUFFER_BIT);
/* Original. */
glTexImage2D(
GL_TEXTURE_2D, 0, GL_RGB, width, height,
0, GL_RGB, GL_UNSIGNED_BYTE, image
);
glUseProgram(program);
glUniform1i(myTextureSampler_location, 0);
glBindVertexArray(vao);
glDrawElements(GL_TRIANGLES, 6, GL_UNSIGNED_INT, 0);
glBindVertexArray(0);
/* Optional CPU modification to compare with GPU shader speed. */
if (cpu) {
image2 = realloc(image2, 3 * width * height * sizeof(image2[0]));
for (unsigned int i = 0; i < height; ++i) {
for (unsigned int j = 0; j < width; ++j) {
size_t index = 3 * (i * width + j);
/* Inverter. */
/*image2[index + 0] = 1.0 - (image[index + 0] / 255.0);*/
/*image2[index + 1] = 1.0 - (image[index + 1] / 255.0);*/
/*image2[index + 2] = 1.0 - (image[index + 2] / 255.0);*/
/* Swapper. */
/*image2[index + 0] = image[index + 1] / 255.0;*/
/*image2[index + 1] = image[index + 2] / 255.0;*/
/*image2[index + 2] = image[index + 0] / 255.0;*/
/* Square linear blur. */
int blur_width = 5;
int blur_width_half = blur_width / 2;
int blur_width2 = (blur_width * blur_width);
image2[index + 0] = 0.0;
image2[index + 1] = 0.0;
image2[index + 2] = 0.0;
for (int k = -blur_width_half; k <= blur_width_half; ++k) {
for (int l = -blur_width_half; l <= blur_width_half; ++l) {
int i2 = i + k;
int j2 = j + l;
// Out of bounds is black. TODO: do module to match shader exactly.
if (i2 > 0 && i2 < (int)height && j2 > 0 && j2 < (int)width) {
unsigned int srcIndex = index + 3 * (k * width + l);
image2[index + 0] += image[srcIndex + 0];
image2[index + 1] += image[srcIndex + 1];
image2[index + 2] += image[srcIndex + 2];
}
}
}
image2[index + 0] /= (blur_width2 * 255.0);
image2[index + 1] /= (blur_width2 * 255.0);
image2[index + 2] /= (blur_width2 * 255.0);
}
}
glTexImage2D(
GL_TEXTURE_2D, 0, GL_RGB, width, height,
0, GL_RGB, GL_FLOAT, image2
);
}
/* Modified. */
glUseProgram(program2);
glUniform1i(myTextureSampler_location2, 0);
glUniform2f(pixD_location2, 1.0 / width, 1.0 / height);
glBindVertexArray(vao2);
glDrawElements(GL_TRIANGLES, 6, GL_UNSIGNED_INT, 0);
glBindVertexArray(0);
glfwSwapBuffers(window);
glfwPollEvents();
common_fps_print();
} while (!glfwWindowShouldClose(window));
/* Cleanup. */
if (cpu) {
free(image2);
}
CommonV4l2_deinit(&common_v4l2);
glDeleteBuffers(1, &vbo);
glDeleteVertexArrays(1, &vao);
glDeleteTextures(1, &texture);
glDeleteProgram(program);
glfwTerminate();
return EXIT_SUCCESS;
}

- 347,512
- 102
- 1,199
- 985
The first obvious answer is that you gain parallelism. Now, why using GLSL rather than, say CUDA which is more flexible ? GLSL doesn't require you to have an NVIDIA graphics card, so it's a much more portable solution (you'd still have the option of OpenCL though).
What can you gain with parallelism ? Most of the time, you can treat pixels independantly. For instance, increasing the contrast of an image usually requires you to loop over all pixels and apply an affine transform of the pixel values. If each pixel is handled by a separate thread, then you don't need to do this loop anymore : you just raterize a quad, and apply a pixel shader that reads a texture at the current rasterized point, and ouput to the render target (or the screen) the transformed pixel value.
The drawback is that your data need to reside on the GPU : you'll need to transfer all your images to the GPU which can take some time, and can make the speedup gained with the parallelization useless. As such, GPU implementations are often done either when the operations to be made are compute intensive, or when the whole pipeline can remain on the GPU (for instance, if the goal is to only display the modified image on screen, you save the need to transfer back the image on the CPU).

- 3,286
- 4
- 29
- 39
-
1Why no mention of OpenCL if you're talking about CUDA? – Nicol Bolas Dec 04 '12 at 01:57
-
Based on benchmarks depicted in "Game Engine Gems 2" GLSL outperforms, in some cases , both OpenCL and CUDA :) – Michael IV Dec 06 '12 at 11:56
-
Nicol: It seems to me that it is mentionned at the third line of my answer, isn't it ? (I have not edited my answer) – nbonneel Jan 04 '13 at 23:13
OpenGL 4.3 (announced at SIGGRAPH 2012) supports Compute shaders. If you are doing strictly graphics work, and already using OpenGL, it might be easier to use this than OpenCL / OpenGL interop (or CUDA / OpenGL interop).
Here is what Khronos has to say about when to use 4.3 Compute shaders versus OpenCL: Link to PDF; see slide 5.

- 6,223
- 1
- 12
- 20