2

As far as I know, in modern mult-core cpu system, different cpus share one memory bus. Does that mean only one cpu could access the memory at one moment since there are only one memory bus which could not be used by more than one cpus at a time?

choxsword
  • 3,187
  • 18
  • 44

2 Answers2

3

Yes, at the simplest level, a single memory bus will only be doing one thing at once. For memory busses, it's normal for them to be simplex (i.e. either loading or storing, not sending data in both directions at once like gigabit ethernet or PCIe).

Requests can be pipelined to minimize the gaps between requests, but transferring a cache-line of data takes multiple back-to-back cycles.


First of all, remember that when a CPU core "accesses the memory", they don't have to directly read from DRAM. The cache maintains a coherent view of memory shared by all cores, using (a variant of) the MESI cache coherency protocol.

Essential reading for the low-level details about how cache + memory works: Ulrich Drepper's 2007 article What Every Programmer Should Know About Memory?, and my 2017 update on what's changed and what hasn't. e.g. a single core can barely saturate the memory controllers on a low-latency dual/quad core Intel CPU, and not even close on a many-core Xeon where max_concurrency / latency is the bottleneck, not the DRAM controller bandwidth. (Why is Skylake so much better than Broadwell-E for single-threaded memory throughput?).


All high-performance / multi-core systems use caches, and normally every core has its own private L1i/L1d cache. In most modern multi-core CPUs, there are 2 levels of private cache per core, with a large shared cache. Earlier CPUs (like Intel Core2) only had private L1 caches, and the large shared last-level cache was L2.

Multi-level caches are essential to give low latency / high bandwidth for the most-hot data while still being large enough to have a high hit rate over a large working set.

Intel divides up their L3 caches into slices on the ring bus that connects cores together. So multiple accesses to different slices of L3 can happen simultaneously. See David Kanter's write-up of Sandybridge. Only on an L3 miss does the request need to be sent to a memory controller. (The memory controllers themselves have some buffering / reordering capability.)


Data written by one core can be read by another core without ever being written back to DRAM. A shared last-level cache acts as a backstop for shared data. (Intel CPUs with inclusive L3 cache also use it as a snoop filter to avoid broadcasting cache-coherency traffic to all cores: Which cache mapping technique is used in intel core i7 processor?).

But the writer will have the cache line in Modified state (and all other cores have it Invalid), so the reader has to request it from the writer to get it in Shared state. This is somewhat slow. See What are the latency and throughput costs of producer-consumer sharing of a memory location between hyper-siblings versus non-hyper siblings?, and What will be used for data exchange between threads are executing on one Core with HT?.

On modern Xeon multi-socket systems, I think it's still the case that dirty data can be sent between sockets without writing back to DRAM. But I'm not sure.

AMD Ryzen has separate L3 for each quad-core cluster, so data transfer between core-clusters is slower than within a single core cluster. (And if all the cores are working on the same data, it will end up replicated in the L3 of each cluster.)


Typical Intel/AMD desktop/laptop systems have dual-channel memory controllers, so (if both memory channels are populated) there can be two burst transfers in flight simultaneous, one to each DIMM.

But if only one channel is populated, or they're mismatched and the BIOS doesn't run them in dual-channel mode, or there are no outstanding accesses to cache lines that map to one of the channels, then memory parallelism is limited to pipelining access to one channel.


I know that modern cpu uses cache to achieve low lantency. So my question is based on the scene that when the computer was just started, there are no data in the cache, so cpus will fetch data directly from the memory

Nobody would design a multi-core system with no caches at all. That would be terribly inefficient because the cores would block each other from accessing the bus to fetch instructions as well as data, as you suspect

One fast CPU can do everything that two half-speed CPUs can do, and some things it can't (like run a single thread fast).

If you can build a CPU complex enough to support SMP operation, you can (and should) first make it support some cache. Maybe just internal tags for external data (for faster hit/miss checking), if we're talking about really old CPUs where the transistor budget for the whole chip was too low for much/any internal cache.

Or you could always have fully external cache outside the CPU, as part of an SMP interconnect. But the CPU has to know about it, at least to be able to mark some memory regions uncacheable so MMIO works, and (if it's not write-through) for consistent DMA. If you want private caches for each core, it can't just be a transparent memory-side cache (i.e. caching just the DRAM, not even seeing accesses to physical memory addresses that aren't backed by DRAM).

Multiple cores on a single piece of silicon only makes sense once you've pushed single-core performance to the point of diminishing returns with pipelining, caches, and superscalar execution. Maybe even out-of-order execution, although there are some multi-core in-order x86 and ARM chips. If running carefully-tuned code, out-of-order execution isn't always necessary for some kinds of problems. For example, GPUs don't use OoO exec because they're just designed for massive throughput with simple control.

Pipelining and caching can give huge speed improvements. See http://www.lighterra.com/papers/modernmicroprocessors/

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Sorry for the scatter-brained partial answers to so many questions. It's a big topic and I wasn't sure exactly what the question was *really* asking. – Peter Cordes Jul 29 '18 at 16:44
  • I know that modern cpu uses cache to achieve low lantency. So my question is based on the scene that when the computer was just started, there are no data in the cache, so cpus will fetch data directly from the memory. – choxsword Jul 30 '18 at 01:33
1

Summary: it's generally possible for a single core to saturate the memory bus if memory access is all it does.

If you establish the memory bandwidth of your machine, you should be able to see if a single-threaded process can really achieve this and, if not, how the effective bandwidth use scales with the number of processors.

now I'll explain further.

it's all depends on the architecture you're using, for now, let's say modern SMP and SDRAM:

1) If two cores tried to access the same address in RAM

could go several ways:

  • they both want to read, simultaneously:
    • two cores on the same chip will probably share an intermediate cache at some level (2 or 3), so the read will only be done once. On a modern architecture, each core may be able to keep executing µ-ops from one or more pipelines until the cache line is ready
    • two cores on different chips may not share a cache, but still need to co-ordinate access to the bus: ideally, whichever chip didn't issue the read will simply snoop the response
  • if they both want to write:
    • two cores on the same chip will just be writing to the same cache, and that only needs to be flushed to RAM once. In fact, since memory will be read from and written to RAM per cache line, writes at distinct but sufficiently close addresses can be coalesced into a single write to RAM
    • two cores on different chips do have a conflict, and the cache line will need to be written back to RAM by chip1, fetched into chip2's cache, modified and then written back again (no idea whether the write/fetch can be coalesced by snooping)

2) If two cores tried to access different addresses

For a single access, the CAS latency means two operations can potentially be interleaved to take no longer (or perhaps only a little longer) than if the bus were idle.

Barr J
  • 10,636
  • 1
  • 28
  • 46
  • What if we add a memory bank to our computer? Will the behavior above make some difference? Under that circumstance , could different cpu aceess different memories ant one moment? – choxsword Jul 29 '18 at 05:56
  • no, the behavior above will stay pretty much the same. – Barr J Jul 29 '18 at 05:58
  • So multi-memories are still regarded as one peripheral from the cpus' view? – choxsword Jul 29 '18 at 05:59
  • I wonder whether it's reasonable to call memory as "peripheral" – choxsword Jul 29 '18 at 07:17
  • well, you can use it as a term if you want it to, it describes the situation at hand, if it's the right term well.. one can argue about it. – Barr J Jul 29 '18 at 07:20
  • Two cores on the same chip normally each have their own private L1d cache, so they still need to get the line in MESI Modified state before they can commit their store data to that cache line. On a single-socket Intel CPU, one core typically has to write-back as far as the shared L3 cache (but not to DRAM). [Which cache mapping technique is used in intel core i7 processor?](https://stackoverflow.com/q/49092541). AMD CPUs use MOESI which allows cache-to-cache transfers of dirty data. As far as externally observable DRAM activity, you'd see one write of the line when eventually evicted. – Peter Cordes Jul 29 '18 at 16:01
  • It's not "generally" true that a single core can saturate DRAM anymore. Intel dual / quad core chips can come close, but many-core Xeons are limited by max_concurrency / latency; they can't keep enough stores in flight to saturate DRAM. [Why is Skylake so much better than Broadwell-E for single-threaded memory throughput?](https://stackoverflow.com/q/39260020), and also [What's missing/sub-optimal in this memcpy implementation?](https://stackoverflow.com/a/26256216) and also [this](https://stackoverflow.com/q/25179738) mentions multiple threads giving a speedup. – Peter Cordes Jul 29 '18 at 16:04