Multiple threads running in parallel on several cores of a CPU. Can they access the main memory at the same time?
-
Maybe. Depends on the addresses, access width, memory architecture and, typically, is not something that is relied upon without some assistance from hardware locking. – Martin James Mar 18 '16 at 16:57
1 Answers
Main memory and shared last-level-cache read bandwidth is a shared resource that multiple cores compete for, but yes, multiple readers reading the same byte of memory will typically complete faster than multiple readers reading from separate pages. (Not true with writes in the mix.)
If the shared-memory region is small enough that it's hot in each core's private cache, then each core can read from it at very high speeds. Writing will slow down other readers and esp. other writers (cf. Are cache-line-ping-pong and false sharing the same?).
Other readers will not be slowed down very much if they don't use any kind of locking, instead relying on lockless algorithms to avoid errors due to race conditions. This is why lockless programming is sometimes worth the (large) challenge of getting it correct, compared to just using locking or producer-consumer flags.

- 1
- 1

- 328,167
- 45
- 605
- 847
-
1While I do agree with the answer, I am very much doubtful OP would understand a single word in it. Just saying. – SergeyA Mar 18 '16 at 19:55
-
3@SergeyA: Me too :P It's a question with a moderately interesting answer, so I answered it. It's not my problem if people ask questions that they won't understand the answer to. They can google those phrases to get started learning more. There are more good links at the [x86 tag wiki](http://stackoverflow.com/tags/x86/info) that I didn't include. I think it's better than just answers "yes, and the full answer is really complicated", because this question was specific enough to have a short-ish answer. (It didn't include anything about synchronization or memory ordering.) – Peter Cordes Mar 18 '16 at 20:07
-
Awhile back I did a small benchmark that compared single reader vs multiple readers (6 to be precise) on a single named shared segment with one writer writing periodically, and each reader was binded to a single dedicated core and all were on the same CPU. The latency actually went up in E5 and Skylake. Intuitively, I have the same thought as you that the average latency should go down as last level caches are shared in most CPUs. Any idea why I see something different? – HCSF Mar 04 '19 at 15:18
-
@HCSF: If multiple threads are competing for read access to the same L3 cache lines, they probably slow each other down. Were you pointer chasing over a large working set to defeat L1d and L2 caches to measure L3 latency? – Peter Cordes Mar 04 '19 at 21:48
-
1@PeterCordes I am not familiar with hardware level interaction. Are you saying that multiple threads reading the same memory address (of which content is likely on L3) will be slower because only 1 thread (or few threads?) has read access to the same location (or cache line? or entire L3 as a while?)? The memory segment was designed as a 64MB ringbuffer, and each thread was having the same reading sequence, and each thread `memcpy()`-ed out the message, and so I don't think there is much pointer chasing unless `memcpy()` somehow serializes. – HCSF Mar 05 '19 at 02:46
-
@HCSF: You said earlier you were measuring *latency*. A memcpy test would be measuring *bandwidth*. That's more obviously able to cause contention. In any given clock cycle, a single slice of L3 cache can only respond to one request. (The ring bus on CPUs before Skylake-X is 32 bytes wide, so it takes 2 cycles per message.) I assume most of this is fairly well pipelined, but it's certainly easy to imagine contention. 64MB is bigger than the L3 in any CPU, so it's easy to imagine multiple readers causing more misses if they don't run in lock-step with each other. HW prefetch is mostly L2 – Peter Cordes Mar 05 '19 at 03:15
-
@PeterCordes `memcpy()` alone might be measuring bandwidth. But my test setup is comparing the time it takes the writing writing the last byte to the reader "able" to `memcpy()` the entire message (400 bytes), and so the comparison between single reader and multiple readers are more like observing the latency impact with multiple readers. Are you saying that the contention coming from multiple cores copying data from L3 to L2 and L1 because accessing L3 is through a unique/single ring bus that is connected to multiple cores? – HCSF Mar 05 '19 at 03:33
-
The miss as you pointed out will be very bad if readers and writer don't run in lock-step. But my writer writes so infrequent and so the test setup makes them run virtually in lockstep. – HCSF Mar 05 '19 at 03:33
-
@HCSF: Ok, so you're measuring the inter-core latency. I was assuming the readers were all just running free because you didn't say what kind of microbenchmark it was, or that there was any synchronization between writer and readers. Anyway, my understanding is that the ring bus can't multicast to multiple readers, so multiple requests for the same cache line at the exact same time will have to be staggered somewhat. (But with different cores being different distances away on the ring bus, they probably woke up from their read spin-loop at different times anyway.) – Peter Cordes Mar 05 '19 at 03:39
-
@HCSF: you're not going to get a satisfactory answer in comments. Write up a real question with a MCVE including your code + experimental results and hardware setup if you want anyone to take a stab at answering it. – Peter Cordes Mar 05 '19 at 03:40