We are working on a workstation Core i7 and AMD FirePro 8000. For video decoding (8K, 7680x4320 video frame ~ 66MB hapq codec ) we tried to use the following obvious loop:
- get frame from stream
- map buffer
- decode frame slices multi-threaded into mapped buffer
- unmap buffer
- texsubimage into texture from bound PBO
BUT the step 3. decode slices multi-threaded into mapped buffer is horribly slow - it takes at least some 40ms to finish
When we split this into tow steps
3a. decode frame slices multi-threaded into malloced memory
3b. memcpy from malloced memory into mapped buffer
both steps take 8+9 ~ 17ms to finish. Now we have a somewhat acceptable solution, but the extra copy step is painful.
Why is multithreaded unpacking into mapped memory so exceptionally slow? How can we avoid the extra copy step?
Edit 1;
This is how the buffer is generated, defined and mapped:
glGenBuffers(1, &hdf.m_pbo_id);
glBindBuffer(GL_PIXEL_UNPACK_BUFFER, hdf.m_pbo_id);
glBufferData(GL_PIXEL_UNPACK_BUFFER, m_compsize, nullptr, GL_STREAM_DRAW);
hdf.mapped_buffer = (GLubyte*)glMapBuffer(GL_PIXEL_UNPACK_BUFFER, GL_WRITE_ONLY);
Edit 2:
There was a question raised how the time is measured. Only the non-gl code is measured. The pseudo code is like this,
Case 1 (very slow, t2-t1 ~ 40ms):
gl_map();
t1 = elapse_time();
unpack_multithreaded_multiple_snappy_slices_into_mapped_buffer();
t2 = elapse_time();
gl_unmap();
Case 2 (medium slow, t3-t2~9ms, t2-t1~8ms):
gl_map();
malloc_sys_buffer();
t1 = elapse_time();
unpack_multithreaded_multiple_snappy_slices_into_sys_buffer();
t2 = elapse_time();
memcpy_sys_buffer_into_mapped_buffer();
t3 = elapse_time();
gl_unmap();
Inside the measured code blocks there is no OpenGL code involved. Maybe it is an write-through / cpu-cache issue.