Should I try to use as many queues as possible?

Question

On my machine I have two queue families, one that supports everything and one that only supports transfer.

The queue family that supports everything has a queueCount of 16.

Now the spec states

Command buffers submitted to different queues may execute in parallel or even out of order with respect to one another

Does that mean I should try to use all available queues for maximal performance?

krOoze · Accepted Answer · 2019-02-17T00:41:26.993

26

Yes, if you have workload that is highly independent use separate queues.

If the queues need a lot of synchronization between themselves, it may kill any potential benefit you may get.

Basically what you are doing is supplying GPU with some alternative work it can do (and fill stalls and bubbles and idles with and giving GPU the choice) in the case of same queue family. And there is some potential to better use CPU (e.g. singlethreaded vs one queue per thread).

Using separate transfer queues (or other specialized family) seem to be the recommended approach even.

That is generally speaking. More realistic, empirical, sceptical and practical view was already presented by SW and NB answers. In reality one does have to be bit more cautious as those queues target the same resources, have same limits, and other common restrictions, limiting potential benefits gained from this. Notably, if the driver does the wrong thing with multiple queues, it may be very very bad for cache.

This AMD's Leveraging asynchronous queues for concurrent execution(2016) discusses a bit how it maps to their HW\driver. It shows potential benefits of using separate queue families. It says that although they offer two queues of compute family, they did not observe benefits in apps at that time. They say they have only one graphics queue, and why.

NVIDIA seems to have a similar idea of "asynch compute". Shown in Moving to Vulkan: Asynchronous compute.

To be safe, it seems we should still stick with only one graphics, and one async compute queue though on current HW. 16 queues seem like a trap and a way to hurt yourself.

With transfer queues it is not as simple as it seems either. You should use the dedicated ones for Host->Device transfers. And the non-dedicated should be used for device->device transfer ops.

edited Feb 17 '19 at 00:41

answered Jun 01 '16 at 20:13

krOoze

12,301
1
20
34

3

That's true. If you're doing lots of transfers (e.g. staging buffers), especially at run time it's recommended to use a dedicated transfer queue. With recent drivers NVIDIA and AMD GPUs should offer queue families that only (or primary) support transfers. – Sascha Willems Jun 01 '16 at 20:57
1

Should? I thought they already did. At least AMD long since beta drivers... And I think I have it from some NVIDIA guy speech. – krOoze Jun 01 '16 at 21:03
4

"*Yes, think of them as CPU cores in this respect.*" I think that analogy, for queues of the same family, gives the wrong impression. It makes it seem like, if a GPU provides 16 separate graphics cores, and you're only submitting work to 1 of them, then you're only using 1/16th of your GPU's hardware. That seems decidedly unlikely. For queues of *different* families, that analogy makes sense. They represent distinct hardware. Queues within the same family don't necessarily do so. – Nicol Bolas Jun 01 '16 at 22:05
@NicolBolas True. Analogies are evil this way. What I meant is, the following two sentences are true for both. Am gonna delete that, it's redundant statement anyway... – krOoze Jun 01 '16 at 22:25
1

@krOoze: _Should? I thought they already did. At least AMD long since beta drivers_ For AMD yes, other vendors exposed the dedicated transfer queue family with newer drivers. If you compare these (release and beta) drivers for a GTX 980 e.g. : http://vulkan.gpuinfo.org/compare.php?compare=compare&id%5B427%5D=on&id%5B40%5D=on – Sascha Willems Jun 02 '16 at 06:15
@SaschaWillems NVIDIA supports it from April 8 (according to changelog). By "recent" I got different impression (like from this week or something). – krOoze Jun 02 '16 at 12:02
2

This doesn't seems right. You don't need more queues to send more independent work to the device. You can cram them all together in the same queue, as @NicolBolas pointed in the comments of his answer, and it is enough to keep the device busy. – lvella Jul 17 '18 at 21:35
PS: There is certainly some API design vs reality problem. The asynchronous work submission does not work as well in reality as much as on paper (bad drivers not helping). I did not want to get into that. It is a matter of taking a hint from the driver. If it offers N queues it tries to say: "There should be non-negative benefit using our N queues, as long as you do not try to forcibly bend your workload to fit that model". I am gonna add AMD and NVIDIA material giving hints on what is going on — though it would be awesome to have answer (in clear terms) here from actual driver maker. – krOoze Jul 19 '18 at 13:20

Nicol Bolas · Answer 2 · 2019-02-17T02:23:56.937

To what end?

Take the typical structure of a deferred renderer. You build your g-buffers, do your lighting passes, do some post-processing and tone mapping, maybe throw in some transparent stuff, and then present the final image. Each process depends on the previous process having completed before it can begin. You can't do your lighting passes until you've finished your g-buffer. And so forth.

How could you parallelize that across multiple queues of execution? You can't parallelize the g-buffer building or the lighting passes, since all of those commands are writing to the same attached images (and you can't do that from multiple queues). And if they're not writing to the same images, then you're going to have to pick a queue in which to combine the resulting images into the final one. Also, I have no idea how depth buffering would work without using the same depth buffer.

And that combination step would require synchronization.

Now, there are many tasks which can be parallelized. Doing frustum culling. Particle system updates. Memory transfers. Things like that; data which is intended for the next frame. But how many queues could you realistically keep busy at once? 3? Maybe 4?

Not to mention, you're going to need to build a rendering system which can scale. Vulkan does not require that implementations provide more than 1 queue. So your code needs to be able to run reasonably on a system that only offers one queue as well as a system that offers 16. And to take advantage of a 16 queue system, you might need to render very differently.

Oh, and be advised that if you ask for a bunch of queues, but don't use them, performance could be impacted. If you ask for 8 queues, the implementation has no choice but to assume that you intend to be able to issue 8 concurrent sets of commands. Which means that the hardware cannot dedicate all of its resources to a single queue. So if you only ever use 3 of them... you may be losing over 50% of your potential performance to resources that the implementation is waiting for you to use.

Granted, the implementation could scale such things dynamically. But unless you profile this particular case, you'll never know. Oh, and if it does scale dynamically... then you won't be gaining a whole lot from using multiple queues like this either.

Lastly, there has been some research into how effective multiple queue submissions can be at keeping the GPU fed, on several platforms (read all of the parts). The general long and short of it seems to be that:

Having multiple queues executing genuine rendering operations isn't helpful.
Having a single rendering queue with one or more compute queues (either as actual compute queues or graphics queues you submit compute work to) is useful at keeping execution units well saturated during rendering operations.

Do you know of any device where asking for queues but not using them severely impacts performance? It seems to me that if this is true, then you should always use just one queue, unless you can keep multiple queues busy at all times. — Quinchilion, Jun 03 '16 at 21:38
I don't believe `vkGetDeviceQueue()` is an state altering call, because there is no corresponding `put` call. It doesn't seems reasonable that retrieving more queues than what is used will impact the performance. — lvella, Jul 17 '18 at 20:36
You assumed OP is using Vulkan for interactive graphics rendering, but compute workloads can be parallel enough to fill up all the queues (imagine op has to process a large bath of independent data, possibly using the graphics pipeline as a helper), so this answer doesn't really matches the question. — lvella, Jul 17 '18 at 20:46
@lvella: I said nothing about `vkGetDeviceQueue()`'s performance. I was talking about how many queues you request at device *creation* time. As for the OP's workload, the OP neglected to specify it; I can't comment on information the OP didn't provide. Also, what reason would you have to expect that submitting a bunch of independent operations would be faster than submitting them one after the other (in a single submission call)? — Nicol Bolas, Jul 17 '18 at 21:12
I don't expect it. I am trying to figure this out. But your answer avoided the problem by saying there is no use for many simultaneous queues. — lvella, Jul 17 '18 at 21:19
@lvella: "*saying there is no use for many simultaneous queues*" I did? Because I'm pretty sure I said "there are many tasks which can be parallelized", followed by a list of several possibilities. That seems to be explicitly saying that there may be a use for it. — Nicol Bolas, Jul 17 '18 at 21:21
The idea that asking for more queues at device creation time than you end up using in practice would have that dramatic an effect on performance seems really speculative to me without actually citing vendor documentation or realistic benchmarks showing this effect. I would guess (equally speculatively) that a well-written graphics driver wouldn't behave that way, since some applications (e.g. a general-purpose game engine) might reasonably ask for more queues than always end up seeing use on every run. I think a reader would need stronger evidence to know which guess is right and how far. — Zoë Sparks, May 11 '22 at 21:45

Sascha Willems · Answer 3 · 2016-06-01T17:35:58.057

That strongly depends on your actual scenario and setup. It's hard to tell without any details.

If you submit command buffers to multiple queues you also need to do proper synchronization, and if that's not done right you may get actually worse performance than just using one queue.

Note that even if you submit to only one queue an implementation may execute command buffers in parallel and even out-of-order (aka "in-flight"), see details on this in chapter chapter 2.2 of the specs or this AMD presentation.

If you do compute and graphics, using separate queues with simultaneous submissions (and a synchronization) will improve performance on hardware that supports async compute.

So there is no definitive yes or no on this without knowing about your actual use case.

score 2 · Answer 4 · answered Jul 17 '18 at 21:36

Since you can submit multiple independent workload in the same queue, and it doesn't seem there is any implicit ordering guarantee among them, you don't really need more than one queue to saturate the queue family. So I guess the sole purpose of multiple queues is to allow for different priorities among the queues, as specified during device creation.

I know this answer is in direct contradiction to the accepted answer, but that answer fails to address the issue that you don't need more queues to send more parallel work to the device.

Should I try to use as many queues as possible?

4 Answers4

Linked