6

If I plan to use multithreading in OpenGL, should I have separate buffers (from glGenBuffers) for each context?

I do not know much about OpenGL multithreading yet (for now I work in "single" thread). I need to know if I can share buffers already pushed to Video Memory (with glBufferData/glBufferSubData), or I have to keep copy of buffer for another thread.

Cœur
  • 37,241
  • 25
  • 195
  • 267
tower120
  • 5,007
  • 6
  • 40
  • 88
  • 1
    First of all, what are you trying to do with multi-threading? If you share resources between contexts, then buffer object names (what `glGenBuffers (...)` returns) will have a shared pool. VAOs, FBOs and other parts of the state machine are ***not*** shared though - basically only objects that actually store data like buffer objects, textures and GLSL programs are shared. Also, if you modify those buffers in other threads while they are being used by draw commands that have not finished yet, you are going to measurably hurt your performance. – Andon M. Coleman May 20 '14 at 17:09
  • @AndonM.Coleman so, VBO's are shared? About why... I thought that one thread can push data to VBO's while other render previously pushed. But know I have doubts about this.. http://stackoverflow.com/questions/11097170/multithreaded-rendering-on-opengl?rq=1 There said that it all paralelize automatically, even if you have some kind of SLI. Am I right? – tower120 May 20 '14 at 17:22
  • 1
    Yes, VBOs are shared ***if*** you setup your render contexts with sharing (default is not to share between contexts). This has nothing to do with the number of GPUs you have. If you have a draw command queued in the pipeline you are not allowed to modify the data it uses until it is finished, or you would produce invalid results. So the driver either makes a copy of your buffer or it stalls the pipeline to prevent this situation. – Andon M. Coleman May 20 '14 at 17:25
  • Andon M. Coleman I'm not try to draw VBO, which I move to the memory at that moment in paralel thread. It is something like this: 1 thread draw VBO1, 2 thread push VBO2; then 1 thread draw VBO2, 2 thread push VBO3 – tower120 May 20 '14 at 17:28
  • In that case, you need fence sync objects to ensure that thread 2 is finished uploading the data before thread 1 tries to draw it. – Andon M. Coleman May 20 '14 at 17:30
  • that iterates per frame. In one frame pushing, in another drawing. – tower120 May 20 '14 at 17:31
  • It does not matter, each context has its own command queue. The only way to ensure that the data upload from thread 2 is complete before thread 1 tries to draw is to insert a fence sync. If this were all done in one context, normal command serialization would prevent this situation. You could also use `glFinish (...)` in thread 2, and signal thread 1 when that finishes, but that adds unnecessary CPU overhead that fence syncs can avoid. – Andon M. Coleman May 20 '14 at 17:33
  • @AndonM.Coleman Thank you, I'll keep this in mind (about glFenceSync). And if I have only one GPU - this technique (one thread push, another draw) will not give performance improvement? Or it is still make sence: ALU write to memory, while shader processors draw scene? :) – tower120 May 20 '14 at 17:39
  • 1
    In the right conditions this can improve performance no matter how many GPUs you use, particularly if you can handle situations where the data uploaded in thread 2 may arrive 1 frame late. But there are some extra synchronization concerns ***you*** have to take care of yourself when you start using multiple contexts for this. In a single context, the driver automatically orders all of the commands correctly, but in multiple contexts it does not and you need fence syncs or `glFinish (...)` and CPU signaling for that to happen. – Andon M. Coleman May 20 '14 at 17:50

2 Answers2

7

You do not want to use several contexts with several threads. You really don't.

While this sounds like a good idea, in practice muli-context-multi-thread is complicated, troublesome, and badly supported on the driver side, and it only marginally improves (possibly even reduces!) performance.

What you really want is to have only one thread talk to OpenGL (with one context, obviously), map a buffer, and pass the memory pointer to another thread, preferrably using 3 buffers (3 subbuffers of a 3x sized buffer) with immutable storage and persistent mapping, if this is available.
That, and doing indirect render calls, where a second thread feeds the buffers the indirect call reads from.

Further info on the persistent mapping topic: See in particular slides 22-25 of this GDC2014 presentation, which is basically a remake of Cass Everitt's 2013 SIGGRAPH talk.
See also Everitt's original talk: Beyond porting.

Damon
  • 67,688
  • 20
  • 135
  • 185
  • Why 3 (three) buffers? – tower120 May 21 '14 at 10:53
  • 2
    Well, actually you don't. You use _one_ that is _three times as big_ (which is effectively 3 buffers, but in one buffer object). You must have at least two buffers (or buffer "subregions") since you are writing while the GPU is reading, so you must be careful not to stomp over data while it's used. Using 3 buffers instead of 2 means that you will most likely never (or very rarely) block on your fence, whereas if you only used 2, you would regularly block, almost every time (losing time). – Damon May 21 '14 at 11:02
  • 1
    Worded differently: With 1 buffer, you must wait for the GPU to finish before you can do _anything_. With 2 buffers, you can write #2 while the GPU reads #1, and you can write #1 while the GPU reads #2. In between you have to wait (stall) so you're sure that you don't corrupt stuff. With 3 buffers, you can write #2 while the GPU reads #1, and continue writing #3 while the GPU maybe still finishes #1 or reads #2. You only need to sync after that, but most likely #1 is already free again by that time. – Damon May 21 '14 at 11:06
6

Vaos aren't shared so you'll need to generate a new vao for each object per context or else the behavior will become unpredictable and incorrect upon deletion / creation of a new one. This can be a major source of error. Vbos can be shared, so you just need one vbo per object.

bvs
  • 340
  • 5
  • 20