Read pixel data from default framebuffer in OpenGL: Performance of FBO vs. PBO

Question

My goal is to read the contents of the default OpenGL framebuffer and store the pixel data in a cv::Mat. Apparently there are two different ways of achieving this:

1) Synchronous: use FBO and glRealPixels

cv::Mat a = cv::Mat::zeros(cv::Size(1920, 1080), CV_8UC3);
glReadPixels(0, 0, 1920, 1080, GL_BGR, GL_UNSIGNED_BYTE, a.data);

2) Asynchronous: use PBO and glReadPixels

cv::Mat b = cv::Mat::zeros(cv::Size(1920, 1080), CV_8UC3);
glBindBuffer(GL_PIXEL_PACK_BUFFER, pbo_userImage);
    glReadPixels(0, 0, 1920, 1080, GL_BGR, GL_UNSIGNED_BYTE, 0);
    unsigned char* ptr = static_cast<unsigned char*>(glMapBuffer(GL_PIXEL_PACK_BUFFER, GL_READ_ONLY));
    std::copy(ptr, ptr + 1920 * 1080 * 3 * sizeof(unsigned char), b.data);
    glUnmapBuffer(GL_PIXEL_PACK_BUFFER);
glBindBuffer(GL_PIXEL_PACK_BUFFER, 0);

From all the information I collected on this topic, the asynchronous version 2) should be much faster. However, comparing the elapsed time for both versions yields that the differences are often times minimal, and sometimes version 1) events outperforms the PBO variant.

For performance checks, I've inserted the following code (based on this answer):

std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now();
....
std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();
std::cout << "Time difference = " << std::chrono::duration_cast<std::chrono::microseconds>(end - begin).count() << std::endl;

I've also experimented with the usage hint when creating the PBO: I didn't find much of difference between GL_DYNAMIC_COPY and GL_STREAM_READ here.

I'd be happy for suggestions how to increase the speed of this pixel read operation from the framebuffer even further.

Your second version is not really asynchronous, since you're blocking for the result to arrive immediately after making the `glReadPixels()` call. — Reto Koradi, Apr 25 '16 at 14:35
You mean the call to `std::copy`? Actually if I comment out this line, the effect is minimal and still version 1 is sometimes faster. — Schnigges, Apr 25 '16 at 14:44
There is a rather big different if I don't map the GPU buffer to CPU memory, which is of course as expected, but necessary after that as I want to store the `cv::Mat` in a vector — Schnigges, Apr 25 '16 at 14:46

score 7 · Accepted Answer · answered Apr 25 '16 at 15:09

Your second version is not asynchronous at all, since you're mapping the buffer immediately after triggering the copy. The map call will then block until the contents of the buffer are available, effectively becoming synchronous.

Or: depending on the driver, it will block when actually reading from it. In other words the driver may implement the mapping in such a way that it causes a pagefault, and a subsequent synchronization. It doesn't really matter in your case, since you are still accessing that data straight away due to the std::copy.

The proper way of doing this is by using sync objects and fences.

Keep your PBO setup, but after issuing the glReadPixels into a PBO, insert a sync object into the stream via glFenceSync. Then, some time later, poll for that fence sync object to be complete (or just wait for it altogether) via glClientWaitSync.

If glClientWaitSync returns that the commands before the fence are complete, you can now read from the buffer without an expensive CPU/GPU sync. (If the driver is particularly stupid and didn't already move the buffer contents into mappable addresses, in spite of your usage hints on the PBO, you can use another thread to perform the map. glGetBufferSubData can be therefore cheaper, as the data doesn't need to be in a mappable range.)

If you need to do this on a frame-by-frame basis, you'll notice that it's very likely that you'll need more than one PBO, that is, have a small pool of them. This is because at the next frame the readback of the previous frame's data is not complete yet and the corresponding fence not signalled. (Yes, GPUs are massively pipelined these days, and they will be some frames behind your submission queue).

I'd add that doing the transfer asynchronously only gives a benefit if you're doing something between triggering the operation (`glReadPixels` with a PBO) and reading the result (`glMapBuffer`, `glGetBufferSubData`, etc). Creating a sync object and immediately waiting on it is no better than just `glReadPixels` directly. — Colonel Thirty Two, Apr 25 '16 at 21:16

Read pixel data from default framebuffer in OpenGL: Performance of FBO vs. PBO

1 Answers1

Linked