What is actually a Queue family in Vulkan?

Question

I am currently learning vulkan, right now I am just taking apart each command and inspecting the structures to try to understand what they mean.

Right now I am analyzing QueueFamilies, for which I have the following code:

vector<vk::QueueFamilyProperties> queue_families = device.getQueueFamilyProperties();
for(auto &q_family : queue_families)
{
    cout << "Queue number: "  + to_string(q_family.queueCount) << endl;
    cout << "Queue flags: " + to_string(q_family.queueFlags) << endl;
}

This produces this output:

Queue number: 16
Queue flags: {Graphics | Compute | Transfer | SparseBinding}
Queue number: 1
Queue flags: {Transfer}
Queue number: 8
Queue flags: {Compute}

So, naively I am understanding this like this:

There are 3 Queue families, one queue family has 16 queues, all capable of graphics, compute, transfer and sparse binding operations (no idea what the last 2 are)

Another has 1 queue, capable only of transfer (whatever that is)

And the final one has 8 queues capable of compute operations.

What is each queue family? I understand it's where we send execution commands like drawing and swapping buffers, but this is a somewhat broad explanation, i would like a more knoweledgeable answer with more details.

What are the 2 extra flags? Transfer and SparseBidning?

And finaly, why do we have/need multiple command queues?

explanation of sparse binding http://asawicki.info/news_1698_vulkan_sparse_binding_-_a_quick_overview.html basically it is a way to use paging like you would on the CPU but on the GPU instead of having to bind all of a resource at once, allowing you to move memory around to stop fragmentation instead of having to recreate all of your resources. — Krupip, Mar 22 '19 at 13:41

score 68 · Accepted Answer · answered Mar 21 '19 at 04:14

To understand queue families, you first have to understand queues.

A queue is something you submit command buffers to, and command buffers submitted to a queue are executed in order[*1] relative to each other. Command buffers submitted to different queues are unordered relative to each other unless you explicitly synchronize them with VkSemaphore. You can only submit work to a queue from one thread at a time, but different threads can submit work to different queues simultaneously.

Each queue can only perform certain kinds of operations. Graphics queues can run graphics pipelines started by vkCmdDraw* commands. Compute queues can run compute pipelines started by vkCmdDispatch*. Transfer queues can perform transfer (copy) operations from vkCmdCopy*. Sparse binding queues can change the binding of sparse resources to memory with vkQueueBindSparse (note this is an operation submitted directly to a queue, not a command in a command buffer). Some queues can perform multiple kinds of operations. In the spec, every command that can be submitted to a queue have a "Command Properties" table that lists what queue types can execute the command.

A queue family just describes a set of queues with identical properties. So in your example, the device supports three kinds of queues:

One kind can do graphics, compute, transfer, and sparse binding operations, and you can create up to 16 queues of that type.
Another kind can only do transfer operations, and you can only create one queue of this kind. Usually this is for asynchronously DMAing data between host and device memory on discrete GPUs, so transfers can be done concurrently with independent graphics/compute operations.
Finally, you can create up to 8 queues that are only capable of compute operations.

Some queues might only correspond to separate queues in the host-side scheduler, other queues might correspond to actual independent queues in hardware. For example, many GPUs only have one hardware graphics queue, so even if you create two VkQueues from a graphics-capable queue family, command buffers submitted to those queues will progress through the kernel driver's command buffer scheduler independently, but will execute in some serial order on the GPU. But some GPUs have multiple compute-only hardware queues, so two VkQueues for a compute-only queue family might actually proceed independently and concurrently all the way through the GPU. Vulkan doesn't expose this.

Bottom line, decide how many queues you can usefully use, based on how much concurrency you have. For many apps, a single "universal" queue is all they need. More advanced ones might have one graphics+compute queue, a separate compute-only queue for asynchronous compute work, and a transfer queue for async DMA. Then map what you'd like onto what's available; you may need to do your own multiplexing, e.g. on a device that doesn't have a compute-only queue family, you might create multiple graphics+compute queues instead, or serialize your async compute jobs onto your single graphics+compute queue yourself.

[*1] Oversimplifying a bit. They start in order, but are allowed to proceed independently after that and complete out of order. Independent progress of different queues is not guaranteed though. I'll leave it at that for this question.

"You can only submit work to a queue from one thread at a time" I am understanding that as, 2 threads can submit commands to the same queue but it needs to be synchronized so that only one of them does it at a time. Is that correct? — Makogan, Mar 21 '19 at 13:51
Yes, that's right. Vulkan calls this "external synchronization", and many objects are externally synchronized like this, which just means that you can't have two threads operating on the object simultaneously (unless all are read-only operations). — Jesse Hall, Mar 22 '19 at 02:20
I thought that was a little out of scope. But sparse binding operations are basically "update page tables to map region [X,X'] of an image or buffer to bytes [Y,Y'] of memory". They're queued and subject to queue synchronization so that it's easier to make them happen in the proper order relative to graphics/compute/transfer operations without heavier-weight "wait on fence, do operation, then submit dependent work" synchronization. — Jesse Hall, Mar 22 '19 at 14:39
Given the example posted by the OP (16 general purpose queues (including transfer ops), but 1 transfer queue), is there any advantage to using the more restrictive queue? — Charlie Su, Oct 27 '20 at 03:47

krOoze · Answer 2 · 2019-07-24T12:53:07.503

A Queue is a thing that accepts Command Buffers containing operations of a given type (given by the family flags). The commands submited to a Queue have a Submission Order, therefore they are subject to synchronization by Pipeline Barriers, Subpass Dependencies, and Events (while across queues Semaphore or beter has to be used).

There's one trick: COMPUTE and GRAPHICS can always implicitly accept TRANSFER workload (even if the QueueFamilyProperties do not list it. See this in Note below Specification of VkQueueFlagBits).

Transfer is for Copy and Blit commands. Sparse is something like paging; it allows to bind multiple Memory handles to a single Image, and it allows to re-bind different memory later too.

In the Specification, below given vkCmd* command it always says which are the "Supported Queue Types".

Queue Family is a group of Queues that have special relation to themselves. Some things are restricted to a single Queue Family, such as Images (they have to be transfered between Queue Families) or Command Pool (creates Command Buffers only for consumption by the given Queue Family and no other). Theoretically on some exotic device there could be more Queue Families with the same Flags.

That's pretty much everything the Vulkan Specification guarantees. See an Issue with this at KhronosGroup/Vulkan-Docs#569

There are some vendor-specific materials given, e.g.:

AMD's Leveraging asynchronous queues for concurrent execution
NVIDIA's Moving to Vulkan: Asynchronous compute

The GPUs have asynchronous Graphics Engine(s), Compute Engine(s), and Copy\DMA Engine(s). The Graphics and Compute would of course contest the same Compute Units of the GPU.

They usually have only one Graphics Frontend. That is a bottleneck for graphics operations, so that means there's no point in using more than one Graphics Queue.

There are two modes of operation for Compute: Synchronous Compute (exposed as GRAPHICS|COMPUTE family) and Async Compute (exposed as COMPUTE-only family). The first is a safe choice. The second can give you about 10 % perf, but is more tricky and requires more effort. The AMD article suggests to always do the first as a baseline.

There can theoretically be as many Compute Queues as there are Compute Units on the GPU. But AMD argues there's no benefit to more than two Async Compute Queues and exposes that many. NVIDIA seems to go with the full number.

The Copy\DMA Engines (exposed as the TRANSFER-only family) are primarily intended for CPU&rlarr;GPU transfers. They would usually not achieve full throughput for an inside-GPU copy. So unless there's some driver magic, the Async Transfer Family should be used for CPU&rlarr;GPU transfers (to reap the Async property, being able to do Graphics next to it unhindered). For inside-GPU copies it should be better for most cases to use the GRAPHICS|TRANSFER family.

**How does a present queue family fit into this?** From what I saw, most people behave as if it was almost certain that it always is the graphics queue family. The exception is the various online tutorials that most often assume, that on the contrary, they are different, though I think that they do it for educational purposes - to show how to use synchronization capabilities. — janekb04, Jul 12 '20 at 20:37
@enthusiastic_3d_graphics_pr... Yes, it is almost certain there will be GRAPHICS+COMPUTE+PRESENT queue family. It is only a theoretical possibility though the present is not supported on that queue (if it is supported at all; Vulkan **does** allow headless, and compute-only implementations); I even [suggested removing that possibility](https://github.com/KhronosGroup/Vulkan-Docs/issues/1234), because I feel like everyone idly talks about the case every other day, while it does not even exist in real HW. — krOoze, Jul 12 '20 at 23:13
@enthusiastic_3d_graphics_pr... PS: Decent compromise feels to just use `VK_SHARING_MODE_CONCURRENT` for the (nonexistent) case; it is relatively unobtrusive and almost the same as if the queues were the same. Technically might be less performant, but who cares for non-existent case. Probably would need to generally reoptimize anyway for such hypothetical exotic HW, if and when someone makes it. — krOoze, Jul 12 '20 at 23:24

What is actually a Queue family in Vulkan?

2 Answers2

Linked

Related