Is computing integral image on GPU really faster than on CPU?

Question

I'm new to GPU computing, so this maybe a really naive question.
I did a few look-ups, and it seems computing integral image on GPU is a pretty good idea.
However, when I really dig into it, I'm wondering maybe it's not faster than CPU, especially for big image. So I just wanna know your ideas about it, and some explanation if GPU is really faster.

So, assuming we have a MxN image, CPU computing of the integral image would need roughly 3xMxN addition, which is O(MxN).
On GPU, follow the code provided by the "OpenGL Super Bible" 6th edition, it would need some KxMxNxlog2(N) + KxMxNxlog2(M) operation, in which K is the number of operations for a lot of bit-shifting, multiplication, addition...
The GPU can work parallel on, say, 32 pixels at a time depend on the device, but it's still O(MxNxlog2(M)).
I think even at the common resolution of 640x480, the CPU is still faster.

Am I wrong here?
[Edit] This is the shader code straight from the book, the idea is using 2 pass: calculate integral of the rows, then calculate the integral of the column of the result from pass 1. This shader code is for 1 pass.

#version 430 core
layout (local_size_x = 1024) in;
shared float shared_data[gl_WorkGroupSize.x * 2];
layout (binding = 0, r32f) readonly uniform image2D input_image;
layout (binding = 1, r32f) writeonly uniform image2D output_image;
void main(void)
{
    uint id = gl_LocalInvocationID.x;
    uint rd_id;
    uint wr_id;
    uint mask;
    ivec2 P = ivec2(id * 2, gl_WorkGroupID.x);
    const uint steps = uint(log2(gl_WorkGroupSize.x)) + 1;
    uint step = 0;
    shared_data[id * 2] = imageLoad(input_image, P).r;
    shared_data[id * 2 + 1] = imageLoad(input_image,
    P + ivec2(1, 0)).r;
    barrier();
    memoryBarrierShared();
    for (step = 0; step < steps; step++)
    {
        mask = (1 << step) - 1;
        rd_id = ((id >> step) << (step + 1)) + mask;
        wr_id = rd_id + 1 + (id & mask);
        shared_data[wr_id] += shared_data[rd_id];
        barrier();
        memoryBarrierShared();
    }
    imageStore(output_image, P.yx, vec4(shared_data[id * 2]));
    imageStore(output_image, P.yx + ivec2(0, 1),
    vec4(shared_data[id * 2 + 1]));
}

@InternetAussie yes, I'm trying it now. Just that researches on the internet show that GPU is significantly faster than CPU, which is quite surprising to me. — TuTan, May 11 '17 at 08:24
The description of the parallel algorithm is missing, but the stated bound appear to be pretty bad. It makes me think the theoretical approach is used, where you work on 1x1,1x2,2x2,2x4,4x4 parts. That is to say, you recursively work on bigger scales, but with only small incremental steps. In practical code, you may very well start with 16x16 blocks at a time. And you might even ignore parallelizing the next step, as the 16x16 blocks are already 256 times rarer than the input pixels — MSalters, May 11 '17 at 14:31
@MSalters I added the code from the book, the idea is calculate integral of the rows, then calculate the integral of the column of the result. — TuTan, May 12 '17 at 03:25
@MSalters I think I'll try your suggestion of 16x16 block, or maybe calculate some other numbers that fit my situation. Thanks — TuTan, May 12 '17 at 03:33

score 3 · Accepted Answer · edited Jun 20 '20 at 09:12

What do you mean by integral image?

My assumption is summing K images of the same resolution MxN together. in such case it is O(K.M.N) on booth CPU and GPU but the constant time can be better on GPU as gfx memory access is much faster than on CPU side. There are also usually more GPU cores than CPU cores favoring the GPU for this.

If the K is too big to fit into GPU texture units U at once than you need to use multiple passes so O(K.M.N.log(K)/log(U)) K>U... where CPU might be faster in some cases. But as previous comment suggested without a test you can only guess. You need also take into account that there are thing like bind-less texturing and texture arrays which allows to do this in single pass (but I am unsure if there are any performance costs for that).

[Edit1] after clearing what you actually want to do

First let assume for simplicity we got square input image NxN pixels. So we can divide the task into H-lines and V-lines separately (similar to 2D FFT) to ease up this process. On top of that we can use subdivision of each line into group of M pixels. So:

N = M.K

Where N is resolution, M is region resolution and K is number of regions.

1st. pass

Render line for each group so we got K lines of size M. Using fragment shader that computes integral image of each region only outputting to some texture. This is T(0.5*K*M^2*N) This whole thing can be done in fragment rendered by single QUAD covering the screen ...
2nd. pass

Convert region integrals to full image integrals. So again render K lines and in fragment add all the last pixels of each previous group. This is T(0.5*K^3*N) This whole thing can too be done in fragment rendered by single QUAD covering the screen ...
do #1,#2 on the result in the other axis direction

This whole thing converts to

T(2*N*(0.5*K*M^2+0.5*K^3))
T(N*(K*M^2+K^3))
O(N*(K*M^2+K^3))

Now you can tweak the M to max performance on your setup ... If I rewrite the whole thing into M,N then:

T(N*((N/M)*M^2+(N/M)^3))
T(N*(N*M+(N/M)^3))

So you should minimize the therm so I would try to use values around

N*M = (N/M)^3
N*M = N^3/M^3
M^4 = N^2
M^2 = N
M = sqrt(N) = N^0.5

So the whole thing converts to:

T(N*(N*M+(N/M)^3))
T(N*(N*N^0.5+(N/N^0.5)^3))
T(N^2.5+N^1.5)
O(N^2.5)

Which is faster than naive O(N^4) But you're right CPU has less operations to do O(N^2) for this and does not require copy of data or multiple passes so you should find out the threshold resolution on specific HW for your task and chose depending on the measurements. PS Hope I did not do a silly mistake somewhere in the computations. Also if you do H and V lines separately on CPU than the CPU side complexity will be O(N^3) and using this approach even O(N^2.5) without the need for 2 pass per axis.

Take a look at this similar QA:

How to implement 2D raycasting light effect in GLSL

I think it is a good start point.

What makes a GPU win out over a CPU here is it's great degree of parallelism. There's only so much you can do in parallel on a CPU (~2 threads per core and depending on architecture up to ~4 instructions per clock cycle), but that's for ridiculously dense and optimized code. However there are certain problems that are hard to parallelize and overhead of splitting up the problem and re-joining the results adds significant overhead (as this Q&A shows). In general if split/join complexity is logarithmic you're mostly good. — datenwolf, May 11 '17 at 08:13
Well, in short, integral image has a pixel 'P[a,b] = sum(A[i,j])' with 'i < a, j — TuTan, May 11 '17 at 08:14
@Spektre thanks a lot for your detailed answer, I'm trying this out now to see which one works better for me — TuTan, May 11 '17 at 10:05

Is computing integral image on GPU really faster than on CPU?

1 Answers1

Linked