DirectX12 Upload Synchronization D3D12_HEAP_TYPE_UPLOAD

Question

I want to ensure that my D3D12_HEAP_TYPE_UPLOAD resource has been upload before I use it.

Apparently to do this you call ID3D12Resource::Unmap, ID3D12CommandList::Close, ID3D12CommandQueue::ExecuteCommandList and then ID3D12CommandQueue::Signal.

However, this confuses me. The call ID3D12Resource::Unmap is completely unconnected to the command list and queue, except by the device the resource was created on. But I have multiple command queues per device. So how does it chose which command queue to upload the resource on?

Is this documented anywhere? The only help I can find are comments in the samples.

See the [DirectX Tool Kit for DX12](https://github.com/microsoft/DirectXTK12), and in particular ``ResourceUploadBatch``. — Chuck Walbourn, May 03 '21 at 20:56

score 3 · Accepted Answer · answered Jun 03 '21 at 13:10

Once you have copied your data to a mapped pointer, it becomes available immediately to be consumed by commands, in case of Upload resources there is no need to Unmap resource in that case (you can unmap on Release or at application shutdown).

However, it is important to note (specially reading by your comments), that command will be executed later on the gpu, so if you plan to reuse that memory you need to have some synchronization mechanisms.

Let's make a simple pseudo code example : You have a buffer called buffer1 (that you already created and mapped), now you have access to its memory via mappedPtr1.

copy data1 to mappedPtr1
call compute shader in commandList
execute CommandList

Now everything will execute properly (for one frame assuming you have synchronization)

Now if you do the following :

copy data1 to mappedPtr1
call compute shader in commandList (1)
copy data2 to mappedPtr1
call compute shader in commandList (1)
execute CommandList

In that case, since you copied data2 at the same place as data1, the first compute shader call will use data2 (at it is the latest available data when you call execute CommandList)

Now let's have a slightly different example :

copy data1 to mappedPtr1
call compute shader in commandList1
execute CommandList1
copy data2 to mappedPtr1
call compute shader in commandList2
execute CommandList2

What will now happen is undefined, since you do not know when CommandList1 and CommandList2 will be effectively processed.

In case CommandList1 is processed (fast enough) before :

copy data2 to mappedPtr1

then data1 will be the current memory and be used

However, if your commandList is a bit heavier and CommandList1 is not yet processed at the time you finish your call to

copy data2 to mappedPtr1

Which is likely to happen, then both compute will again use data2 when used by the gpu.

This is because executeCommandList is a non blocking function, when it returns it only means that your commands have been prepared for execution, not that the commands have been processed.

In order to guarantee that you use the correct data at the correct time, you have in that case several options:

1/Use a fence and wait for completion

copy data1 to mappedPtr1
call compute shader in commandList1
execute CommandList1 on commandQueue
attachSignal (1) to commandQueue 
add a waitevent for value (1)  
copy data2 to mappedPtr1
call compute shader in commandList2
execute CommandList2 on commandQueue
attachSignal (2) to commandQueue 
add a waitevent for value (2)

This is simple but is vastly inefficient, since now you wait for your gpu to finish all execution of commandList before to continue any cpu work.

2/Use different resources :

since now you copy to 2 different locations you will of course guarantee that your data is different accross both calls.

3/Use a single resource with offsets.

You can also create a resource larger that can hold data for all your calls, then copy once.

I'll assume your data is 64 bytes here (so you would create a 128 byte buffer)

copy data1 to mappedPtr1 (offset 0)
bind address from mappedPtr1 (offset 0) to compute
call compute shader in commandList1
execute CommandList1 on commandQueue 
copy data2 to mappedPtr1 (offset 64)
bind address from mappedPtr1 (offset 64) to compute
call compute shader in commandList2
execute CommandList2 on commandQueue

Please note that you should still have fences to indicate when a frame have finished to be processed, this is the only way to guarantee you that upload part can finally be reused.

If you want to copy the data to a default heap (specially if you do it on a separate copy queue), you will also need a Fence on the copy queue and a wait in the main queue to ensure the copy queue has finished processing and that data is available (you also need, as per the other answer, to set up resource barriers in the default heap resource in that case)

Hope it makes sense.

Thanks, this was helpful, I ended up solving it with solution 2 before you answered, but it's good to get some confirmation that the order of execution is not guaranteed because other people are saying otherwise https://stackoverflow.com/questions/52385998/in-dx12-what-ordering-guarantees-do-multiple-executecommandlists-calls-provide — Tom Huntington, Jul 31 '21 at 23:38
@TomHuntington The order of execution of your command list (on the GPU) is guaranteed, if you call execute A then execute B, A will always run before B (on the gpu side). The issue in your case being that you might override your data in your upload heap before those commands are executed. — mrvux, Aug 08 '21 at 14:51
" The specification states that commands start execution in-order, but complete out-of-order. Don’t get confused by this. The fact that commands start in-order is simply convenient language to make the spec language easier to write. Unless you add synchronization yourself, all commands in a queue execute out of order." I assume that D3D12 is similar, and believe that this is what was causing my problem. — Tom Huntington, Sep 30 '21 at 21:57
Yes indeed, but it is not relevant to your question. as it is more related to resources you write in the gpu, to control synchronization at this level you also need resource barriers. (so for example if both of your compute shaders write to overlapping area of the same resource, you need a uav barrier between them). — mrvux, Oct 01 '21 at 15:30
I've come back to upload synchronization again, and think this is a great answer. On reflection I must have assumed that unmaping, using the resource in a command list and executing, then mapping again would give me different memory to write into, if the gpu hadn't finished using data1. But it must just give back the same memory and data1 is overwritten by data2 which what the gpu ends up using even though the data2 was only written there after the command list was executed — Tom Huntington, Jul 02 '22 at 01:06
@TomHuntington Indeed, mapping and unmapping works kinda like that in D3D11 (it will eventually (not always) give you a different chunk of memory. Please note that in d3d12 you do not need to unmap, it is common for apps to just map on creation and unmap on resource destruction). — mrvux, Jul 07 '22 at 12:18

score 1 · Answer 2 · answered May 03 '21 at 21:06

Per Microsoft Docs, all that Map and Unmap do is deal with the virtual memory address mapping on the CPU. You can safely leave a resource mapped (i.e. keep it mapped into virtual memory) over a long time, unlike with Direct3D 11 where you had to Unmap it.

Almost all the samples use the UpdateSubresources helper in the D3DX12.H utility header. There a few overloads of this, but they all do the same basic thing:

Create/Map an 'intermediate' resource (i.e. something on an upload heap).
Take data from the CPU and copy it into the 'intermediate' resource (unmapping it when complete since there's no need to keep the virtual memory address assignment around).
Then call CopyBufferRegion or CopyTextureRegion on a command-list (which can be a graphics queue command-list, a copy queue command-list, or a compute-queue command-list).

You can post as many of these into a command-list as you want, but the 'intermediate' resource must remain valid until it completes.

As with most things in Direct3D 12, you do this with a fence. When that fence is complete, you know you can release the 'intermediate' resources. Also, none of the copies will actually start until after you close and submit the command-list for execution.

You also need to transition the final resource from a copy state to a state you can use for rendering. Typically you post these on the same command-list, although there are limitations if you are using copy-queue or compute-queue command-lists.

For a full implementation of this, see DirectX Tool Kit for DX12

Note that it is possible to render a texture or use vertex/index buffers directly from the upload heap. It's not as efficient as copying it into a default heap, but is akin to the Direct3D 11 USAGE_DYNAMIC. In this case, it would make sense to keep the upload heap "mapped" and re-use the same address once you know it's no longer in use. Otherwise, corruption or other bad things can happen.

Sorry I was unclear. I was trying to read an upload heap from a compute shader once per frame, but some frames were receiving the data of previous frames. Maybe I should just do an explicit copy to a default heap — Tom Huntington, May 04 '21 at 21:51
It only happens when I call 'ID3D12CommandQueue::ExecuteCommandList' multiple times per frame — Tom Huntington, May 04 '21 at 22:05

UltraPanic · Answer 3 · 2022-09-20T21:10:01.723

1

According to this article from NVIDIA, an upload buffer is not copied until the GPU needs the buffer. Right before a draw (or copy) call is executed any upload buffers used by the call will be uploaded to GPU ram.

This means three things:

It is rather simple to know when you can execute the draw call. Just ensure that the memcpy call has returned before executing the command list.
It is a bit more complicated to know when the draw call has uploaded the buffer, i.e. when you can change the buffer for the next frame. Here a fence is needed to get that info back from the GPU.
Since the upload is done for every draw call, only use an upload buffer if the data changes between every draw call. Otherwise optimize the rendering process by copying the upload buffer into a GPU bound buffer.

edited Sep 20 '22 at 21:10

answered Sep 20 '22 at 20:31

UltraPanic

11
3

It also may be a good idea to pre-upload the data to the gpu, so your draw call does not have to wait for the upload to happen. – Tom Huntington Sep 20 '22 at 21:21
Only if you know you have a period of time where the GPU would otherwise be idle. Then you could use that time to copy the buffer into a GPU bound buffer. – UltraPanic Sep 20 '22 at 21:38

Tom Huntington · Answer 4 · 2022-09-20T21:17:04.107

Just summerising the mental model:

D3D12_HEAP_TYPE_UPLOAD or D3D12_HEAP_TYPE_READBACK have no (stateful) gpu backing memory, but rather only cpu memory. An the upload/readback happens every time they are used, usually by CopyResource/CopyBufferRegion/CopyTextureRegion, and (in the upload case) whatever state of the mapped cpu memory is in when this operator occurs is what you get on the gpu.

The upload and copy are simultaneous and a new upload occurs for each copy.

However, as gpu operations are asynchronous, you have to use synchronization primitives to ensure that the mapped cpu memory is in the right state when the gpu upload-copy operation occurs.

In my case, this involves making sure I don't overwrite the current data with future data before the gpu upload-copy operation completes.

The typical usage pattern is to have a ringbuffer of D3D12_HEAP_TYPE_UPLOAD resources. For each iteration of the render loop the next resource in the ringbuffer gets copied in to ~~the same D3D12_HEAP_TYPE_DEFAULT resource~~ . Edit: this is unsafe when frame buffering and I believe it was the original bug I had. @mrvux described a very real problem, just not the one I was having.

DirectX12 Upload Synchronization D3D12_HEAP_TYPE_UPLOAD

4 Answers4

Linked