I'm new to GPU computing, so this maybe a really naive question.
I did a few look-ups, and it seems computing integral image on GPU is a pretty good idea.
However, when I really dig into it, I'm wondering maybe it's not faster than CPU, especially for big image. So I just wanna know your ideas about it, and some explanation if GPU is really faster.
So, assuming we have a MxN image, CPU computing of the integral image would need roughly 3xMxN addition, which is O(MxN).
On GPU, follow the code provided by the "OpenGL Super Bible" 6th edition, it would need some KxMxNxlog2(N) + KxMxNxlog2(M) operation, in which K is the number of operations for a lot of bit-shifting, multiplication, addition...
The GPU can work parallel on, say, 32 pixels at a time depend on the device, but it's still O(MxNxlog2(M)).
I think even at the common resolution of 640x480, the CPU is still faster.
Am I wrong here?
[Edit] This is the shader code straight from the book, the idea is using 2 pass: calculate integral of the rows, then calculate the integral of the column of the result from pass 1. This shader code is for 1 pass.
#version 430 core
layout (local_size_x = 1024) in;
shared float shared_data[gl_WorkGroupSize.x * 2];
layout (binding = 0, r32f) readonly uniform image2D input_image;
layout (binding = 1, r32f) writeonly uniform image2D output_image;
void main(void)
{
uint id = gl_LocalInvocationID.x;
uint rd_id;
uint wr_id;
uint mask;
ivec2 P = ivec2(id * 2, gl_WorkGroupID.x);
const uint steps = uint(log2(gl_WorkGroupSize.x)) + 1;
uint step = 0;
shared_data[id * 2] = imageLoad(input_image, P).r;
shared_data[id * 2 + 1] = imageLoad(input_image,
P + ivec2(1, 0)).r;
barrier();
memoryBarrierShared();
for (step = 0; step < steps; step++)
{
mask = (1 << step) - 1;
rd_id = ((id >> step) << (step + 1)) + mask;
wr_id = rd_id + 1 + (id & mask);
shared_data[wr_id] += shared_data[rd_id];
barrier();
memoryBarrierShared();
}
imageStore(output_image, P.yx, vec4(shared_data[id * 2]));
imageStore(output_image, P.yx + ivec2(0, 1),
vec4(shared_data[id * 2 + 1]));
}