0

I am working in a bare-metal environment and thus evaluating performance at a low-level. How should I expect two threads on the same core to perform when writing to different sections of the same cache line?

I am somewhat new to multicore/multithread architectures. I understand that when different cores write to the same cache line locks or atomic operations are required to ensure race conditions are avoided. At the same time sharing a cache line between cores also sets one up for performance issues such as false sharing.

However, do I need to worry about similar things when the two threads are on the same core? I'm unsure seeing as they share the same cache and there are multiple load-store units. For example, say thread1 writes to section1 of the cache line at the same time that thread2 wants to write to section2 of the cache line. Does each thread just modify its own section of the cache line, or do they read the full line, modify their section, and write the full line back into the cache? If it's the latter do I need to worry about race conditions or performance delays?

old_timer
  • 69,149
  • 8
  • 89
  • 168
Tyler
  • 17
  • 8
  • 1
    Hint: Write a ping-pong test that bounces data back-and-forth between two threads. Then pin the two threads onto the same core (via hyperthreading) and observe the timings. – Mysticial Apr 19 '17 at 20:17
  • Thanks for the suggestion. I was trying to understand the expectations first but just running a test is probably a solid option. That'll help me on the performance delay front... I guess I could do a test for race conditions too, write an incrementing number into a shared cache line and see if the data ever doesn't match what was written. Although I'm unsure whether that would prove that it isn't possible or that it just didn't happen to occur. – Tyler Apr 19 '17 at 20:33
  • Related: https://stackoverflow.com/questions/32979067/what-will-be-used-for-data-exchange-between-threads-are-executing-on-one-core-wi (actually a duplicate if this question is about Intel with HT. You say you have 128B cache lines, so maybe not. What [SMT](https://en.wikipedia.org/wiki/Simultaneous_multithreading) microarchitecture are you using?) – Peter Cordes Aug 29 '17 at 00:20
  • See also: https://stackoverflow.com/questions/45602699/what-are-the-latency-and-throughput-costs-of-producer-consumer-sharing-of-a-memo for a test like what @Mysticial suggested. Reading/writing the same lines can lead to lots of memory-order mis-speculation pipeline clears on Intel hardware. The store buffer is partitioned between the two hyperthreads, so false-sharing of a cache line is still a bad problem. – Peter Cordes Aug 29 '17 at 00:21
  • @PeterCordes: Thank you so much for your input above and below. Those links were very helpful. Since posting this question I've learned more about the affects of store/load queues and from what I read in [https://stackoverflow.com/questions/45602699/...](https://stackoverflow.com/questions/45602699/what-are-the-latency-and-throughput-costs-of-producer-consumer-sharing-of-a-memo) and your comments it is obvious to me now that my question here was actually trying to ask about store and load queues, and how they might be impacted by "hyperthreads" sharing cache lines. – Tyler Oct 24 '17 at 20:07
  • 1
    @PeterCordes: I am working with an IBM POWER8 chip, so Power ISA. I am specifically looking at SMT2 mode in this case, so two threads.The cpu manual explains that the store queue is dynamically shared among threads and when loads hit stores in the queue they are candidates for store forwarding. That said, I know that in SMT2 mode each thread has it's own LSU. It is not obvious to me however whether each LSU has it's own store-queue and, if so, if store-forwarding is allowed between LSUs. – Tyler Oct 24 '17 at 20:59
  • Almost certainly not for stores that are still speculative. Possible for stores that have retired (and thus are ready to commit to L1D but haven't yet). Power's memory ordering is weak enough that it would be ok to let the SMT sibling(s) see your stores before they become visible to threads on other cores, I think (by committing to L1D). (Unlike x86, where all cores have to agree on a total store order so side-channel forwarding wouldn't be allowed unless it was done speculatively with roll-back if another core wanted the line before the forwarded store commits to L1D....) – Peter Cordes Oct 24 '17 at 21:06
  • Anyway, it would maybe be a valid design choice, but would require more transistors to check the other thread's store queue(s) for retired stores, as well as your own store queue for *all* stores. If you want to know, you'll have to experiment on the microarchitecture you care about if you can't find anything definitive. (Or edit this question and maybe get an answer if anyone else knows.) – Peter Cordes Oct 24 '17 at 21:07
  • @PeterCordes: Thanks, I think I've gotten a good enough understanding of the mechanisms involved. I can proceed with my own tests and will better understand the results. Much appreciated. – Tyler Oct 26 '17 at 16:37

1 Answers1

-2

You are over-complicating this.

There are different layers of caches, depends very specifically on the cpu you are using not just generically x86 or arm, but which architecture version/generation, but you may have an L1 cache intimately connected to the individual cores, then L2 is where the cores come together on the way to shared memory/address space.

All a cache does at whatever layer is sit on the main memory (space) bus and watch things go by, if a transaction is tagged as cacheable, then it examines its tags to see if there is a hit or miss and acts accordingly. The cache does not know, cannot know, nor care who or what caused that transaction, was it an instruction what instruction, what task/program/thread was that instruction from, is it a prefetch, is it a dma engine. doesnt care, there is a transaction like any other follow the rules, pass it on through if not cacheable, if cacheable look for hits and deal with hits or misses.

So from that if you have more than one core/cpu hitting a shared cache, and for some reason they happen to be accessing memory so close that it is in the same cache line, well then the cache will react accordingly.

if you have the same cpu with two threads, will the whole at the same time thing doesnt apply, of course it doesnt apply on the shared cpu as well, you could have them one clock apart but it is a shared bus, generally not dual/multi-ported at this level. but despite that the cache will act per its design, ignore and pass on if marked as not-cacheable, or search for a hit if it is and act accordingly.

old_timer
  • 69,149
  • 8
  • 89
  • 168
  • What you will hopefully learn is that benchmarking is b---s---, you are not going to get accurate results, your results not only should vary from run to run esp if on an operating system with multiple cores hitting shared resources, but cache lines get evicted all the time and from a programmers perspective it appears random, interrupts and other tasks, moving the mouse more or less, will affect what code runs when and what is in the cache and what isnt which can affect performance from run to run as well as an attempt to predict performance. – old_timer Apr 19 '17 at 17:38
  • Not that you cannot evaluate performance with caches and branch predictors and other things, it is just not a case where you can generally assume and then design for two threads hitting a cache line which you expect to be in the cache when that happens, and then what is the performance that results. It is easy to make the performance bad or good at times with what appear to be suble things, add a nop in the boot strap, re-arrange the objects on the compile command line, etc...that benchmarking and performance evaluation, not lose their value completely but ... – old_timer Apr 19 '17 at 17:43
  • thanks for your reply.I design for bare-metal hard-real-time and operate entirely out of cache aside from passing a little data between cores, which will pass through main memory. For myself deterministic timing, run-to-run, is critical but you're right, I could probably dismiss jitter on this order. I'm more concerned about losing data. I suppose my confusion is that I pictured the threads as almost having an additional cache that would be the size of one cache line that they would operate on and write back. If they operate directly on L1 then there should be no issue, right? – Tyler Apr 19 '17 at 18:59
  • depends on the processor is the l1 in each core or is l1 shared? if each core has its own and you have one program/thread per core then they own that cache. Then when it gets to the shared cache then so long as other activity doesnt bump that cache line to main memory, doesnt matter who touches it it is there in cache. – old_timer Apr 19 '17 at 19:10
  • generally everything in l1 is in l2 for a single core, for a shared l2, the individual l1s could compete for l2 space and bump, but same story if you get an l1 miss and an l2 hit then doesnt matter who it is. Now you have the coherency problem to deal with if threads that need to share data dont share the same L1 and that is most likely chip/architecture depending on how to resolve that. – old_timer Apr 19 '17 at 19:14
  • at the end of the day the cache is only slightly smarter than memory which is really dumb, cache just does tag lookups and determines whether it has a copy or needs to get a copy (or needs to save its copy). It is no smarter than that, the programmer and the cpu and its features determine the rest. so you have to figure out how big a cache line even is if that matters to you, if everything is in cache then just operate wihtin that many kbytes of address space and you are golden (other than coherency) – old_timer Apr 19 '17 at 19:16
  • Each core has its own L1, L2, and L3. I don't need to worry about threads on separate cores contaminating each other's memory regions, that's taken care of, but I'm currently making decisions on laying out the memory regions for multiple threads sharing a core. Currently the data being used on thread1 and thread2 of the same core could be interlacing in memory, not separated - trying to decide if that's problematic. The cache line size is 128 bytes, so I can definitely have more than one piece of data in a line. – Tyler Apr 19 '17 at 19:28
  • This answer is bogus. The L1D cache itself is simple, yes; What makes it complicated (and a performance issue) is that each logical thread on an SMT CPU has a separate store queue. https://stackoverflow.com/questions/45602699/what-are-the-latency-and-throughput-costs-of-producer-consumer-sharing-of-a-memo. Stores go into the store queue at execution, and only commit to L1D after retirement. Loads probe the store queue as well as L1D. On x86, ordering rules require a total store order, so letting the HT sibling see the store early isn't allowed. Nor seeing it before retirement on any ISA. – Peter Cordes Aug 29 '17 at 00:32
  • 2
    This answer seems get caching mostly wrong and doesn't even begin to address the actual question about threads on the _same_ core which has mostly nothing to do with the caches at all (since sibling threads share all levels of cache). – BeeOnRope Aug 29 '17 at 00:57