On DMA and CPU Concurrency

Question

I wondered what sort of operations can CPU handle/perform while a memory operation is in progress by a DMA-controller of a device, to increase the level of concurrency? And if the CPU cache/registers is empty, how another instruction can be fetched without interleaving DMA in progress

Thx

Any specific architecture in mind? Not all CPUs and DMAs are equal. Might be way too broad. — Sami Kuhmonen, Jan 28 '17 at 19:31

score 5 · Accepted Answer · edited Jul 30 '19 at 15:36

5

It general, on big¹ hardware, the CPU can do more or less anything while a DMA is in progress. In general, it simply continues with normal execution of running processes or kernel tasks under the control of the OS.

In regards to your question:

... if the CPU cache/registers is empty, how another instruction can be fetched without interleaving DMA in progress[?]

As I understand it, you are asking what happens if the CPU needs to access memory. In general, the CPU is usually accessing memory frequently, not only when "registers or caches are empty". This activity can proceed more or less normally² when a DMA is in progress. The memory bus is already generally shared by several devices, including multiple DMA-capable devices, PCI cards, multiple cores or multiple CPUs. The memory controller is responsible for accepting and fulfilling all these requests, which include arbitrating between then.

So you are correct that there may be some type of "interleaving" when both DMA and the CPU access memory, just as this may occur when two cores (or even two logical threads running on the same core) access memory. How it works out in practice depends on how the DRAM is organized, how the memory controller works (and how many are present) and many other details, but in general you expect modern memory systems to be highly parallel - capable of sustaining multiple streams of accesses and often approaching the bandwidth limits imposed by RAM.

¹These days that pretty much means anything bigger than an embedded microcontroller. E.g., even mobile CPUs qualify.

²By normally I mean the normal mechanisms are used and you can expect memory access to work, but not that performance won't be affected. The memory access by the CPU will be competing with the DMA access (and perhaps other access by other CPUs, PCI devices such as video cards, etc) and will likely be slower - but on reasonable hardware it certainly won't have to wait until the DMA finishes!

edited Jul 30 '19 at 15:36

simpadjo

3,947
1
13
38

answered Jan 28 '17 at 19:46

BeeOnRope

60,350
16
207
386

1

Whilst DMA and CPU memory access is interleaved / managed, it can cause noticable effects. A well written algorithm can completely saturate the memory bus bandwidth, and is "I/O bound". If a DMA happens your algorithm then slows down, because DMA transfer has to squeeze through the same bus! I've had this happen, felt quite proud! – bazza Jan 28 '17 at 21:53
1

Thanks for the detailed info. I didn't really quite aware of that "sharing" thing. Memory controller must be really doing some crazy stuff. And @bazza, it must be also hard to get to that point where you can test that diff. Congrats! – stdout Jan 29 '17 at 00:08
1

@bazza - definitely! I didn't meant to imply otherwise. Basically the path to DRAM and Dram itself is a shared resource and DMA use of that resource isn't free. That said DMA doesnt usually lock the bus or anything so access to memory is still possible. *Most* algorithms aren't super sensitive to memory bandwidth, but rather to latency, and a good controller will generally satisfy concurrent requests in a reasonable way. – BeeOnRope Jan 29 '17 at 01:11
1

On weaker hardware, everything may be different. Perhaps a DMA transfer monopolizes the bus, or the memory controller can't handle concurrent request and so on. – BeeOnRope Jan 29 '17 at 01:13
1

@BeeOnRope, oh, I hadn't read it that way, I was merely seeking to add to your excellent answer :) Indeed, most algorithms don't get that far, and this is something that Intel have been very successful in judging - acceptable performance across a wide range of general purpose computing situations. And indeed a smaller CPU with a tighter transistor budget may well be designed as you describe. – bazza Jan 29 '17 at 08:41
@zgulser I have implemented some DSP systems that have saturated the memory bus. FFTs lend themselves to being written so that L1,L2 (& L3 if present) caches succeed in keeping the CPU core itself fully busy. This is what Intel's MKL library strives to do. As soon as the data array size exceeds the cache size, data for the FFT has to start coming from DRAM. And if you're doing that enough (e.g. a very large FFT, or several at once on separate cores), the DRAM bus itself starts struggling. And then a network DMA starts... Intel's VTune reveals a lot. – bazza Jan 29 '17 at 08:49
FWIW I added a footnote to my claim that the memory access proceeds "normally" to clarify that it means that functionally the CPU-to-memory access proceeds, but perhaps at reduced performance as @bazza pointed out. – BeeOnRope Jan 29 '17 at 15:24
@BeeOnRope One last thing about the interleaved access that might happen between for example CPU and DMA (or IO devices). You mentioned that may happen when DMA monopolizes the bus. First of all, is it related to becoming a bus master or sth else? And, I know it depends on many things but do you have any idea about what's going to happen when both try to access the same memory location? – stdout Feb 08 '17 at 09:45
1

Well about "monopolizing the bus" I'm not an EE and I don't know all the details - it was simply an observation that smaller/cheaper/lower power devices don't necessarily have the complexity needed to say arbitrate in a fine-grained manner among requests, or may have an electrical design which prevents it. That is basically an observation and doesn't necessarily apply to any particular device. About your other question, the comments really aren't the best place to ask questions. Just ask a new question on SO. @zgulser – BeeOnRope Feb 08 '17 at 21:51
@Bee Alright thanks. I already asked it http://stackoverflow.com/questions/42403764/cpus-in-multi-core-architectures-and-memory-access/42412139#42412139. But I think one thing I'm still thinking of is - by saying "shared" did you mean two devices can access the memory at the same time on the same bus? Or you rather meant "arbitraring" between devices which means one device can rule at a time. – stdout Feb 23 '17 at 15:45
You got a reasonable answer there, and in general I think your question is a bit too black-and-white for what is a complex dance across many pieces of hardware, but in any case I gave my best shot [at a detailed answer](http://stackoverflow.com/a/42459847/149138). @zgulser – BeeOnRope Feb 25 '17 at 18:18

On DMA and CPU Concurrency

1 Answers1

Linked