Why are std::hardware_con/destructive_interference_size different?

Question

First, I've had only found one question about std::hardware_destructive_interference_size and std::hardware_constructive_interference_size here and this doesn't answer the following question: why are there two distinct values for this. Both should be the same as the cacheline-size. So what cache-architecture could mandate that there are two distinct values?

"*Both should be the same as the cacheline-size.*" Should they? Can you explain why they should be the same? In a way that *doesn't* use specific implementation details? — Nicol Bolas, Dec 18 '19 at 16:03
The size that says what's the maximum size of a data-structue that shares only a single cacheline should be the same as the minimum difference between two data-structure to prevent false sharing. — Bonita Montero, Dec 18 '19 at 16:08
That's the article I mentioned. But it doesn't tell why these sizes are different. — Bonita Montero, Dec 18 '19 at 17:12
As I've understood it the sizes _doesn't need_ to be different. The separate definitions just cover the case that they _could_ be different i.e. to cover exotic H/W as well. (I must admit I don't try too hard to think about H/W. I'm too busy to get my S/W running and hopefully free of U.B. and, maybe, even with performance) ;-) — Scheff's Cat, Dec 18 '19 at 17:19
It depends on the maximum achievable alignment. If it is less than the L1 cache line size then the compiler can't ensure that a variable is stored at the start of a cache line. — Hans Passant, Dec 18 '19 at 17:34
_"Both should be the same as the cacheline-size."_ Does not need to be. Some architectures apply prefetching where two consecutive cache lines are involved. See those comments by Peter Cordes: https://stackoverflow.com/questions/39680206/understanding-stdhardware-destructive-interference-size-and-stdhardware-cons/39887282#comment127425357_39887282. — Daniel Langr, May 22 '23 at 08:56

score 1 · Answer 1 · answered Dec 20 '19 at 18:43

At least two types of cache designs can have different minimum alignment for avoiding false sharing and maximum alignment for true sharing: sectored cache blocks and variably aligned cache blocks.

A sector cache block that fetches the entire block (IBM-speak; sector in Intel-speak; unit of tag coverage) on a miss would have the block (sector) size for std::hardware_constructive_interference_size. Since smaller sectors (IBM-speak; line in Intel-speak; unit of validity) would be invalidated by remote (or different level cache) writes, std::hardware_destructive_interference_size would be the size of this smaller chunk. This is an design that has been implemented.

(It is not clear if a system that typically prefetches the adjacent block would have std::hardware_constructive_interference_size as twice the cache block/line size while having the cache block/line size for std::hardware_destructive_interference_size.)

Variably aligned cache blocks* (a design targeting larger cache blocks with slightly less cache block internal fragmentation wasted capacity) align storage at a smaller value than cache block size. E.g., a 64B cache block could be aligned at an even or odd 32B alignment; std::hardware_constructive_interference_size would be 32B (since an odd-32B aligned cache block would not fetch the complementary half of a 64B aligned chunk) but std::hardware_destructive_interference_size would be 128B (since an odd-32B aligned cache block would interfere with two 64B-aligned addresses). Variably aligned cache blocks also breaks the concept of alignment being sufficient for managing this aspect of cache performance.

Another possibility that would break these definitions would be a strided cache (a limited form of data trace cache). A cache that supported blocks with 2-word stride (i.e., one block storing words 0, 2, 4, etc. but not words 1, 3, 5, etc.) would significantly mess with the assumption behind std::hardware_constructive_interference_size and std::hardware_destructive_interference_size. While such cache blocks would typically be allocated for strided vector caching, the design violates the expectation of orthogonality and could cause performance problems when non-strided accesses are introduced later.

The proposal for variable alignment mapped an alignment to a way and used overlaid skewed associativity to avoid capacity waste when any alignment was more common than another.

That's wrong. With sectored caches the unit of coherence between the cores is a sector. The maximum size for true sharing should be a sector and the minimum distance between objects to avoid false sharing should be a sector also. So both should be equally sized as well. — Bonita Montero, Dec 21 '19 at 19:30
@BonitaMontero True sharing is the fetch width, which is the cache block (IBM)/cache sector (Intel) when the entire block/sector is fetched (which is typically the case for such caches, at least from main memory — reads satisfied from another cache might leave other sectors/lines invalid). Coherence (invalidation) is at the granularity of sector/line, so false sharing is at that granularity. std::hardware_constructive_interference_size is more concerned with how much memory will be brought in on an access, particularly from main memory. — , Dec 21 '19 at 19:34
No, true sharing means that there aren't any avoidable collisions with other caching-units on other cores. It is not necessarily bound to the fetch-width. With sectored caches different cores can share different sectors of the same cacheline. So this isn't related to the cacheline as a whole. — Bonita Montero, Dec 21 '19 at 21:00

Why are std::hardware_con/destructive_interference_size different?

1 Answers1