I understand how temporal/spatial locality affect design decisions when coding and I also understand when alignment affects cache performance. However, could somebody please demonstrate an example of some C++ where the cache associativity is taken into account to make a piece of code faster?
Lets say x86, Intel CPU where the L1 cache is 8-way set associative, the L2 is 8-way set associative and the L3 is 16-way set associative.
(My overall aim of this question is to understand how set associativity affects performance when writing code and "programming to the hardware" to gain performance when you know your target architecture)