I have read that when accessing with a stride
for (int i = 0; i < aSize; i++) a[i] *= 3;
for (int i = 0; i < aSize; i += 16) a[i] *= 3;
both loops should perform similarly, as memory accesses are in a higher order than multiplication.
I'm playing around with google benchmark and while testing similar cache behaviour, I'm getting results that I don't understand.
template <class IntegerType>
void BM_FillArray(benchmark::State& state) {
for (auto _ : state)
{
IntegerType a[15360 * 1024 * 2]; // Reserve array that doesn't fit in L3
for (size_t i = 0; i < sizeof(a) / sizeof(IntegerType); ++i)
benchmark::DoNotOptimize(a[i] = 0); // I have compiler optimizations disabled anyway
}
}
BENCHMARK_TEMPLATE(BM_FillArray, int32_t);
BENCHMARK_TEMPLATE(BM_FillArray, int8_t);
Run on (12 X 3592 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x6)
L1 Instruction 32 KiB (x6)
L2 Unified 256 KiB (x6)
L3 Unified 15360 KiB (x1)
---------------------------------------------------------------
Benchmark Time CPU Iterations
---------------------------------------------------------------
BM_FillArray<int32_t> 196577075 ns 156250000 ns 4
BM_FillArray<int8_t> 205476725 ns 160156250 ns 4
I would expect accessing the array of bytes to be faster than the array of ints as more elements fit in a cache line, but this is not the case.
Here are the results with optimizations enabled:
BM_FillArray<int32_t> 47279657 ns 47991071 ns 14
BM_FillArray<int8_t> 49374830 ns 50000000 ns 10
Anyone please can clarify this? Thanks :)
UPDATE 1:
I have read the old article "What programmers should know about memory" and everything is more clear now. However, I've tried the following benchmark:
template <int32_t CacheLineSize>
void BM_ReadArraySeqCacheLine(benchmark::State& state) {
struct CacheLine
{
int8_t a[CacheLineSize];
};
vector<CacheLine> cl;
int32_t workingSetSize = state.range(0);
int32_t arraySize = workingSetSize / sizeof(CacheLine);
cl.resize(arraySize);
const int32_t iterations = 1536 * 1024;
for (auto _ : state)
{
srand(time(NULL));
int8_t res = 0;
int32_t i = 0;
while (i++ < iterations)
{
//size_t idx = i% arraySize;
int idx = (rand() / float(RAND_MAX)) * arraySize;
benchmark::DoNotOptimize(res += cl[idx].a[0]);
}
}
}
BENCHMARK_TEMPLATE(BM_ReadArraySeqCacheLine, 1)
->Arg(32 * 1024) // L1 Data 32 KiB(x6)
->Arg(256 * 1024) // L2 Unified 256 KiB(x6)
->Arg(15360 * 1024);// L3 Unified 15360 KiB(x1)
BENCHMARK_TEMPLATE(BM_ReadArraySeqCacheLine, 64)
->Arg(32 * 1024) // L1 Data 32 KiB(x6)
->Arg(256 * 1024) // L2 Unified 256 KiB(x6)
->Arg(15360 * 1024);// L3 Unified 15360 KiB(x1)
BENCHMARK_TEMPLATE(BM_ReadArraySeqCacheLine, 128)
->Arg(32 * 1024) // L1 Data 32 KiB(x6)
->Arg(256 * 1024) // L2 Unified 256 KiB(x6)
->Arg(15360 * 1024);// L3 Unified 15360 KiB(x1)
I would expect random accesses to perform much worse when working size doesn't fit the caches. However, these are the results:
BM_ReadArraySeqCacheLine<1>/32768 39936129 ns 38690476 ns 21
BM_ReadArraySeqCacheLine<1>/262144 40822781 ns 39062500 ns 16
BM_ReadArraySeqCacheLine<1>/15728640 58144300 ns 57812500 ns 10
BM_ReadArraySeqCacheLine<64>/32768 32786576 ns 33088235 ns 17
BM_ReadArraySeqCacheLine<64>/262144 32066729 ns 31994048 ns 21
BM_ReadArraySeqCacheLine<64>/15728640 50734420 ns 50000000 ns 10
BM_ReadArraySeqCacheLine<128>/32768 29122832 ns 28782895 ns 19
BM_ReadArraySeqCacheLine<128>/262144 31991964 ns 31875000 ns 25
BM_ReadArraySeqCacheLine<128>/15728640 68437327 ns 68181818 ns 11
what am I missing?
UPDATE 2:
I'm using now what you suggested (linear_congruential_engine) to generate the random numbers, and I'm using only static arrays, but the results are now even more confusing to me.
Here is the updated code:
template <int32_t WorkingSetSize, int32_t ElementSize>
void BM_ReadArrayRndCacheLine(benchmark::State& state) {
struct Element
{
int8_t data[ElementSize];
};
constexpr int32_t ArraySize = WorkingSetSize / sizeof(ElementSize);
Element a[ArraySize];
constexpr int32_t iterations = 1536 * 1024;
linear_congruential_engine<size_t, ArraySize/10, ArraySize/10, ArraySize> lcg; // I've tried with many params...
for (auto _ : state)
{
int8_t res = 0;
int32_t i = 0;
while (i++ < iterations)
{
size_t idx = lcg();
benchmark::DoNotOptimize(res += a[idx].data[0]);
}
}
}
// L1 Data 32 KiB(x6)
// L2 Unified 256 KiB(x6)
// L3 Unified 15360 KiB(x1)
BENCHMARK_TEMPLATE(BM_ReadArrayRndCacheLine, 32 * 1024, 1);
BENCHMARK_TEMPLATE(BM_ReadArrayRndCacheLine, 32 * 1024, 64);
BENCHMARK_TEMPLATE(BM_ReadArrayRndCacheLine, 32 * 1024, 128);
BENCHMARK_TEMPLATE(BM_ReadArrayRndCacheLine, 256 * 1024, 1);
BENCHMARK_TEMPLATE(BM_ReadArrayRndCacheLine, 256 * 1024, 64);
BENCHMARK_TEMPLATE(BM_ReadArrayRndCacheLine, 256 * 1024, 128);
BENCHMARK_TEMPLATE(BM_ReadArrayRndCacheLine, 15360 * 1024, 1);
BENCHMARK_TEMPLATE(BM_ReadArrayRndCacheLine, 15360 * 1024, 64);
BENCHMARK_TEMPLATE(BM_ReadArrayRndCacheLine, 15360 * 1024, 128);
BENCHMARK_TEMPLATE(BM_ReadArrayRndCacheLine, 15360 * 1024 * 4, 1);
BENCHMARK_TEMPLATE(BM_ReadArrayRndCacheLine, 15360 * 1024 * 4, 64);
BENCHMARK_TEMPLATE(BM_ReadArrayRndCacheLine, 15360 * 1024 * 4, 128);
Here are the results (optimizations enabled):
// First template parameter is working set size.
// Second template parameter is array elemeent size.
BM_ReadArrayRndCacheLine<32 * 1024, 1> 2833786 ns 2823795 ns 249
BM_ReadArrayRndCacheLine<32 * 1024, 64> 2960200 ns 2979343 ns 236
BM_ReadArrayRndCacheLine<32 * 1024, 128> 2896079 ns 2910539 ns 204
BM_ReadArrayRndCacheLine<256 * 1024, 1> 3114670 ns 3111758 ns 236
BM_ReadArrayRndCacheLine<256 * 1024, 64> 3629689 ns 3643135 ns 193
BM_ReadArrayRndCacheLine<256 * 1024, 128> 3213500 ns 3187189 ns 201
BM_ReadArrayRndCacheLine<15360 * 1024, 1> 5782703 ns 5729167 ns 90
BM_ReadArrayRndCacheLine<15360 * 1024, 64> 5958600 ns 6009615 ns 130
BM_ReadArrayRndCacheLine<15360 * 1024, 128> 5958221 ns 5998884 ns 112
BM_ReadArrayRndCacheLine<15360 * 1024 * 4, 1> 6143701 ns 6076389 ns 90
BM_ReadArrayRndCacheLine<15360 * 1024 * 4, 64> 5800649 ns 5902778 ns 90
BM_ReadArrayRndCacheLine<15360 * 1024 * 4, 128> 5826414 ns 5729167 ns 90
How is it possible that for (L1d < workingSet < L2) results do not differ much against (workingSet < L1d)? Throughput and latency of L2 are still very high, but with the random accesses I'm trying to prevent prefetching and force cache misses.. so, why I'm not even noticing a minimal increment?
Even when trying to fetch from main memory (workingSet > L3) I'm not getting a massive performance drop. You mention that latest architectures can hold bandwidths of up to ~8bytes per clock, but I understand that they must copy a hold cache line, and that without prefetching with a predictable linear pattern, latency should be more noticeable in my tests... why is not the case?
I suspect that page faults and tlb may have something to do too.
(I've downloaded vtune analyzer to try to understand all this stuff better, but it's hanging on my machine and I've waiting for support)
I REALLY appreciate your help Peter Cordes :)
I'm just a GAME programmer trying to show my teammates whether using certain integer types in our code might (or not) have implications on our game performance. For example, whether we should worry about using fast types (eg. int_fast16_t) or using the least possible bytes in our variables for better packing (eg. int8_t).