One host-visible buffer instead of multiple - should I consider some alignment?

Question

Let's say I have 280 bytes of data. If I would create a single buffer then according to VkMemoryRequirements the allocated size should be 512 bytes with the alignment of 512 - that's clear. But I need one host-visible big buffer which can hold 3 such datas (which is better than 3 buffers, according to nvidia). And it's not clear to me - should I specify VkBufferCreateInfo::size equal to 280 * 3 or 512 * 3? If I make it equal to 512 * 3 it's a waste of space. If I make it equal to 280 * 3 can I expect problems when mapping the memory? Specification mentions that the mapping range should be multiple of VkPhysicalDeviceLimits::nonCoherentAtomSize but only for the memory that was allocated without the VK_MEMORY_PROPERTY_HOST_COHERENT_BIT, which is not my case. Does the host coherent memory guarantees byte-granularity memory updates?

Ekzuzy · Accepted Answer · 2018-06-14T20:27:30.017

3

If You want to create one buffer that can hold 3 * 280 bytes of data, then You need to create a buffer that can hold 3 * 280 bytes of data (You need to specify this value as a size during buffer creation). But how much memory it will require (how large a memory object should be), it is up to a driver. You need to create a buffer of size equal to 3 * 280, then You need to check it's memory requirements, then allocate necessary memory object (or sub-allocate from a larger memory object) and bind this memory to the buffer.

As for alignment - this matters if You want to bind parts of a single memory object to multiple resources (buffers or images). In Your example, You can create 3 buffers which can hold 280 bytes of data. But (as indicated by the vkGetBufferMemoryRequirements() function) each such buffer requires 512 bytes of memory aligned to 512 bytes. So for the purpose of 3 separate buffers, You would need 3 separate memory objects, each of size 512 bytes, or a single memory object of size 1536 bytes. Then a memory range from offset 0 could be bound to the first buffer, from offset 512 to the second buffer and from offset 1024 to the third buffer. But despite You bind a 512 bytes of memory to Your buffer, don't forget that Your buffer can still hold only 280 bytes of memory.

In this example the size and alignment are the same (both are 512). Imagine a situation that Your buffer of size 380 bytes requires 386 bytes in memory aligned to 512. Such situation doesn't change anything - Your first buffer is bound to offset 0 (this offsets always meets all alignment requirements), second to offset 512 and third buffer to offset 1024. In general, alignment means that the start of memory range bound to a resource, must be a multiple of a given alignment value (counting from the beginning of a memory object).

In Your case, one big buffer is probably better (in terms of wasted memory space): 3 * 280 equals 840 and the relative difference between required memory size and the size of Your buffer will be probably smaller.

edited Jun 14 '18 at 20:27

answered Jun 14 '18 at 20:20

Ekzuzy

3,193
1
16
14

1

Thank you. Your first assumption was correct - I want to create one buffer that can hold `3 * 280` bytes of data. You told that I need to create a buffer that can hold `3 * 280` bytes of data. If I understand correctly this will not work with non-coherent memory. Let's say `nonCoherentAtomSize` is 300, so I can't map the range `[0, 280]`, but can `[0, 300]`, which overlaps the second part of the data (since all data resides continuously), which is not good. That was a non-coherent memory. But what about coherent? Is it safe to map `[0, 280]` range in this case? – nikitablack Jun 14 '18 at 21:30
1

@nikitablack You can map any range You want (probably You want to map the whole range, the whole buffer, and keep the pointer to that memory - mapping operation can be expensive). `nonCoherentAtomSize` value does not influence mapping, but synchronization. From the spec: *"`nonCoherentAtomSize` is the size and alignment in bytes that bounds concurrent access to host-mapped device memory."* When You synchronize GPU operations with reading or writing to a host-visible memory, You must broaden the range that needs to be synchronized to the nearest multiples of this value. – Ekzuzy Jun 14 '18 at 22:12
Spec: "The application must guarantee that any previously submitted command that writes to this range has completed before the host reads from or writes to that range, and that any previously submitted command that reads from that range has completed before the host writes to that region. If the device memory was allocated without the HOST_COHERENT flag, **these guarantees must be made for an extended range**: the application must round down the start of the range to the nearest multiple of nonCoherentAtomSize, and round the end of the range up to the nearest multiple of nonCoherentAtomSize." – Ekzuzy Jun 14 '18 at 22:14
So in Your case, You can still create a buffer that is `280 * 3` large. You just can't synchronize parts of this memory with greater granularity then the `nonCoherentAtomSize` value. For example, as far as I understand it, at the same time (concurrently) You cannot submit commands that read data in the first 10 bytes of memory, and You cannot read the next 10 bytes that were modified by earlier commands. You must synchronize larger parts of memory according to the above rules. – Ekzuzy Jun 14 '18 at 22:18
the problem is not in the mapping, the problem is in writing to mapped memory. Obviously, I can't write to the parts of memory that are in use. And I need guarantees that if I write to the first part of continuous memory, the second part (which is potentially in use) should stay untouchable. With non-coherent memory it's clear - I can't map any range, I need to respect `nonCoherentAtomSize`. But with coherent memory it's not that clear. – nikitablack Jun 15 '18 at 06:20
@nikitablack You CAN map any range. You just must make sure that You don't modify it while it is being used by a GPU (or that the GPU finished modifying it when You want to read or modify it). I'd assume that with COHERENT flag, the granularity is 1 byte. But if this is a problem for You, maybe You should keep two copies of Your buffer and switch them between uses. In fact, it is quite common to do so - device uses one buffer, You modify another buffer, while the next time device uses another buffer while You access the first one. – Ekzuzy Jun 15 '18 at 06:55
thank you, for me it's not a problem, I just want to understand how to do it correctly. And again, buffers here are important - memory is important. I still can have multiple buffers binded to the wrong parts of memory and switching between them in this case will not help. But I think I agree with you about coherent memory granularity of 1 byte, so `280 * 3` should be fine. – nikitablack Jun 15 '18 at 07:20

Jesse Hall · Answer 2 · 2018-06-15T02:32:41.633

2

When you bind the buffer to memory, the memoryOffset needs to be a multiple of the alignment value returned in VkMemoryRequirements. So you should have three VkBuffers of 280 bytes each, but you'll bind them as:

// stride = 512 in your example: 512 rounded up to a multiple of 512.
// would still be true if memoryRequirements.size was just 280.
// if 512 < memoryRequirements.size <= 1024, stride would be 1024, etc.
VkDeviceSize stride = round_up(memoryRequirement.size, memoryRequirement.alignment);

vkBindBufferMemory(device, buffer0, memory, 0 * stride);
vkBindBufferMemory(device, buffer1, memory, 1 * stride);
vkBindBufferMemory(device, buffer2, memory, 2 * stride);

So the size of the VkDeviceMemory needs to be 3*memoryRequirements.size, or 1536 bytes in your example.

The nonCoherentAtomSize is independent of all of that. It's essentially the cache line or memory transaction size. For non-coherent memory, if you write one byte in a "non coherent atom", the CPU will still have to write out the whole atom to memory, which means you'll clobber any simultaneous writes to that atom from the GPU. With coherent memory, the CPU and GPU cooperate so that they can each write adjacent bytes without overwriting each other's data. But if you're using non-coherent memory and want to write to one of your VkBuffers when the GPU might be writing to another VkBuffer that's in the same VkDeviceMemory, you probably want to make sure the two VkBuffers don't overlap within the same nonCoherentAtomSize chunk of the buffer.

edited Jun 15 '18 at 02:32

answered Jun 14 '18 at 20:18

Jesse Hall

6,441
23
29

Thank you, but you didn't understand me, or more probably I was not clear - I need to create one buffer which should serve as a container for 3 datas. Please see the comment to Ekzuzy's answer. You are telling that it's safe to map `[0, 280]` range for coherent memory. But my understanding is that at some point the data written to mapped memory need to be flushed somehow to gpu/main memory, right? But flushing happens by cache line size (i.e. 64 bytes in most cases), meaning `[0, 280]` couldn't be updated since it's not multiple of 64 (or whatever is cache line size). – nikitablack Jun 14 '18 at 21:30
@nikitablack: Then flush a multiple of the line size. What's your problem? – Nicol Bolas Jun 14 '18 at 22:21
@NicolBolas This can lead to undefined behavior, isn't it? Example - I have a continuous memory with 2 ranges - `[0, 280)` and `[280, 560)`. Let's say, the second one is in use, so no one should write to it. The first is safe to update. So I map the memory range `[0, 280)`, memcopy the data. Since the memory is coherent the flushing happens automatically on submit. But since the range is not multiple of 64, the implementation will flush 320 bytes (multiple of 64) instead of 280, so it will write to the part of memory that is in use (second range). Am I right in my assumptions? – nikitablack Jun 15 '18 at 06:10
@nikitablack: "*I have a continuous memory with 2 ranges - [0, 280) and [280, 560).*" Why would you do that if you know you have to flush on a certain alignment? My point is that you should be properly aligning your ranges to the various alignments required by the specification. – Nicol Bolas Jun 15 '18 at 13:11
@NicolBolas But specification doesn't require any alignment for _coherent_ memory. The problem I tried to explain is similar to _False Sharing_ - if two objects are in the same cache line then updating one of them automatically invalidates the second. I don't know how coherent memory works on cpu and gpu, but I bet cpu still flushes the multiple of a cache line data size. Anyway, I don't know how to explain the problem clearly, it's confusing :( – nikitablack Jun 15 '18 at 13:44
1

@nikitablack: If the specification doesn't require it, then *there is no requirement* and there's nothing to worry about. You're really overthinking the problem. [Coherent memory is implemented in a way so that such concerns are irrelevant](https://stackoverflow.com/a/36241756/734069). – Nicol Bolas Jun 15 '18 at 13:46

One host-visible buffer instead of multiple - should I consider some alignment?

2 Answers2