The following example code generates a matrix of size N
, and transposes it SAMPLES
number of times.
When N = 512
the average execution time of the transposition operation is 2144 μs
(coliru link).
At first look there is nothing special, right?...
Well, here are the results for
N = 513
→1451 μs
N = 519
→600 μs
N = 530
→486 μs
N = 540
→492 μs
(finally! theory starts working :).
So why in practice are these simple calculations so different from theory? Does this behavior relate to CPU cache coherence or cache misses? If so please explain.
#include <algorithm>
#include <iostream>
#include <chrono>
constexpr int N = 512; // Why is 512 specifically slower (as of 2016)
constexpr int SAMPLES = 1000;
using us = std::chrono::microseconds;
int A[N][N];
void transpose()
{
for ( int i = 0 ; i < N ; i++ )
for ( int j = 0 ; j < i ; j++ )
std::swap(A[i][j], A[j][i]);
}
int main()
{
// initialize matrix
for ( int i = 0 ; i < N ; i++ )
for ( int j = 0 ; j < N ; j++ )
A[i][j] = i+j;
auto t1 = std::chrono::system_clock::now();
for ( int i = 0 ; i < SAMPLES ; i++ )
transpose();
auto t2 = std::chrono::system_clock::now();
std::cout << "Average for size " << N << ": " << std::chrono::duration_cast<us>(t2 - t1).count() / SAMPLES << " (us)";
}