OpenGL PBO mapped buffer: multi-threaded unpack slow, memcpy fast

Question

We are working on a workstation Core i7 and AMD FirePro 8000. For video decoding (8K, 7680x4320 video frame ~ 66MB hapq codec ) we tried to use the following obvious loop:

get frame from stream
map buffer
decode frame slices multi-threaded into mapped buffer
unmap buffer
texsubimage into texture from bound PBO

BUT the step 3. decode slices multi-threaded into mapped buffer is horribly slow - it takes at least some 40ms to finish

When we split this into tow steps

3a. decode frame slices multi-threaded into malloced memory

3b. memcpy from malloced memory into mapped buffer

both steps take 8+9 ~ 17ms to finish. Now we have a somewhat acceptable solution, but the extra copy step is painful.

Why is multithreaded unpacking into mapped memory so exceptionally slow? How can we avoid the extra copy step?

Edit 1;

This is how the buffer is generated, defined and mapped:

glGenBuffers(1, &hdf.m_pbo_id);
glBindBuffer(GL_PIXEL_UNPACK_BUFFER, hdf.m_pbo_id);
glBufferData(GL_PIXEL_UNPACK_BUFFER, m_compsize, nullptr, GL_STREAM_DRAW);
hdf.mapped_buffer = (GLubyte*)glMapBuffer(GL_PIXEL_UNPACK_BUFFER, GL_WRITE_ONLY);

Edit 2:

There was a question raised how the time is measured. Only the non-gl code is measured. The pseudo code is like this,

Case 1 (very slow, t2-t1 ~ 40ms):

gl_map();
t1 = elapse_time();
unpack_multithreaded_multiple_snappy_slices_into_mapped_buffer();
t2 = elapse_time();
gl_unmap();

Case 2 (medium slow, t3-t2~9ms, t2-t1~8ms):

gl_map();
malloc_sys_buffer();
t1 = elapse_time();
unpack_multithreaded_multiple_snappy_slices_into_sys_buffer();
t2 = elapse_time();
memcpy_sys_buffer_into_mapped_buffer();
t3 = elapse_time();
gl_unmap();

Inside the measured code blocks there is no OpenGL code involved. Maybe it is an write-through / cpu-cache issue.

Is the buffer persistently mapped or not? Is the buffer allocated with immutable storage? Which flags are set? — BDL, Aug 29 '18 at 08:32
I added the buffer gen/define/map code above, please have a look. Is this what you were asking for? — Heiner, Aug 29 '18 at 12:01
How do you measure the timings? What exactly is included in the measurements. The problem is that due to synchronization issues there can be a huge difference depending on how exactly you wrote the code. Example: `map - long calculation - unmap - draw - repeat`. In this case map can take very long because it waits until draw has finished. But that doesn't mean that the long calculation is slower. It just starts later. Without seeing a [MCVE] I don't think it is possible to answer the question. — BDL, Aug 31 '18 at 11:55
Giving a complete compilable example is impossible. But regarding mesuring the elapsed time, it is exactly as stated above. I added pseudo code for maximal clarity. — Heiner, Sep 02 '18 at 11:44
"Maybe it is an write-through / cpu-cache issue." Yes, this would be a possible explanation. You should have a look onto the actual mapping to see what is going on in your particular implementation. On windows, the some tools from the sysinternals suite (forget which one of them exactly) can show you the mapping informations. You could try to switch to a persistently mapped buffer, it _might_ lead to a different mapping (and also avoids the map/unmap overhead). — derhass, Sep 02 '18 at 13:51

score 1 · Answer 1 · answered Oct 29 '20 at 03:07

Unpacking into mapped memory is slow because this memory is write-combined. For each write into this type of memory full cache line is transferred to GPU over the bus. The best way to interact with this memory is to write data in as big chunks as possible. To avoid extra copy step you may need to modify your decoder to write large contiguous chunks of memory. It's also good to experiment with the number of threads writing. There is a great overview of this here https://fgiesen.wordpress.com/2013/01/29/write-combining-is-not-your-friend/

OpenGL PBO mapped buffer: multi-threaded unpack slow, memcpy fast

1 Answers1