I have an application where I need to broadcast a single (non-constant, just plain old data) value in global memory to all threads. The threads only need to read the value, not write to it. I cannot explicitly tell the application to use the constant cache (with e.g. cudaMemcpyToSymbol) because I am using a memory-wrapping library that does not give me explicit low-level control.
I am wondering how this broadcast takes place under the hood, and how it may differ from the usual access pattern where each thread accesses a unique global memory location (for simplicity assume that this "usual" access pattern is coalesced). I am especially interested in any implicit serializations that may take place in the broadcast case, and how this may be affected by different architectures.
For example, for Fermi, presumably the first thread to access the value will pull it to the L2 cache, then to its SM's L1 cache, at which point every thread resident on the SM will attempt to grab it from the L1 cache. Is there any serialization penalty when all threads attempt to access the same L1 cache value?
For Kepler, presumably the first thread to access the value will pull it to the L2 cache (then may or may not pull it to the L1 cache depending on whether L1 caching is enabled). Is there any serialization penalty when all threads attempt to access the same value in L2?
Also, is partition camping a concern?
I found another couple of questions that addressed a similar topic, but not at a level of detail sufficient to satisfy my curiosity.
Thanks in advance!