Why are these matrix transposition times so counter-intuitive?

Question

The following example code generates a matrix of size N, and transposes it SAMPLES number of times. When N = 512 the average execution time of the transposition operation is 2144 μs (coliru link). At first look there is nothing special, right?...

Well, here are the results for

N = 513 → 1451 μs
N = 519 → 600 μs
N = 530 → 486 μs
N = 540 → 492 μs (finally! theory starts working :).

So why in practice are these simple calculations so different from theory? Does this behavior relate to CPU cache coherence or cache misses? If so please explain.

#include <algorithm>
#include <iostream>
#include <chrono>

constexpr int N       = 512; // Why is 512 specifically slower (as of 2016)
constexpr int SAMPLES = 1000;
using us = std::chrono::microseconds;

int A[N][N];

void transpose()
{
    for ( int i = 0 ; i < N ; i++ )
    for ( int j = 0 ; j < i ; j++ )
        std::swap(A[i][j], A[j][i]);
}

int main()
{
    // initialize matrix
    for ( int i = 0 ; i < N ; i++ )
    for ( int j = 0 ; j < N ; j++ )
        A[i][j] = i+j;

    auto t1 = std::chrono::system_clock::now();
    for ( int i = 0 ; i < SAMPLES ; i++ )
        transpose();
    auto t2 = std::chrono::system_clock::now();

    std::cout << "Average for size " << N << ": " << std::chrono::duration_cast<us>(t2 - t1).count() / SAMPLES << " (us)"; 
}

How many times did you run the snippet? Runtimes can greatly vary from run to run, depending on how many other things your system might be doing. Are these the average times of about 10 or 20 runs, or just the timings from a single run? — JGroven, Mar 02 '17 at 20:05
Probably 512 is a magic size that fits the cache horribly so you get a lot of cache misses. The other sizes fit better so you get less misses. — NathanOliver, Mar 02 '17 at 20:05
Wrong way round @NathanOliver - 512 is a lot *slower* than 513. — Martin Bonner supports Monica, Mar 02 '17 at 20:06
disassembly would help :) might be a loop unrolling threshold or something similar — OMGtechy, Mar 02 '17 at 20:11
I think some compiler optimization things are happening. If you change `A` to volatile, then you get about the same time as everything else. http://coliru.stacked-crooked.com/a/54e5c63f7011edcd — Fantastic Mr Fox, Mar 02 '17 at 20:16
I think it's a cache issue. With 32-bit int, and N = 512, A is exactly 8MB in size. — iheanyi, Mar 02 '17 at 20:25
Actually if you use a block size of 4 ints you go down from `1500 us` to `400 us` for 512 and from `550 us` to `260 us` for 520 ^^ So it looks like caching is the big player here — BeyelerStudios, Mar 02 '17 at 20:38
@iheanyi - For A[512][512] and 32 bit (4 byte) integers, I calculate 512*512*4 = 1048576 == 1MB in size (512*512*4 == 1024*1024). — rcgldr, Mar 02 '17 at 20:45
@NarekAtayan - Probably a cache issue. Cache coherency requires the core send out write update info to the other cores (and/or processors on a multi-processor motherboard), but I don't know the overhead on the core that is doing the writes. I'm wondering if global optimization would convert the loop into a single instance of transpose instead of SAMPLES instances. — rcgldr, Mar 02 '17 at 21:02
@NathanOliver There exist in some CPUs a notion of *false aliasing*, which is to say that because two addresses are identical in all of k least-significant bits, the cache hardware may wrongly assume their two cachelines alias and evict one of them, when in fact they do not. — Iwillnotexist Idonotexist, Mar 02 '17 at 21:06
@IwillnotexistIdonotexist: *False aliasing* is a different notion and is not the problem here. The effects of early eviction due to *cache associativity* are similar, but please use the correct nomenclature. — Ben Voigt, Mar 02 '17 at 21:09
BTW Mysticial already explained it: http://stackoverflow.com/a/8547993/103167 — Ben Voigt, Mar 02 '17 at 21:09
@rcgldr you're correct. I think I must've had a running commentary in my mind to use 32 bits and divide by 8, but skipped the 32 and divide and just put the 8 when I did the math. — iheanyi, Mar 02 '17 at 22:31
Seems like a duplicate of this - http://stackoverflow.com/questions/11413855/why-is-transposing-a-matrix-of-512x512-much-slower-than-transposing-a-matrix-of — Leeor, Mar 03 '17 at 21:33

score 6 · Answer 1 · answered Mar 02 '17 at 20:56

This is due cache misses. You can use valgrind --tool=cachegrind to see the amount of misses. Using N = 512 you got the following output:

Average for size 512: 13052 (us)==21803== 
==21803== I   refs:      1,054,721,935
==21803== I1  misses:            1,640
==21803== LLi misses:            1,550
==21803== I1  miss rate:          0.00%
==21803== LLi miss rate:          0.00%
==21803== 
==21803== D   refs:        524,278,606  (262,185,156 rd   + 262,093,450 wr)
==21803== D1  misses:      139,388,226  (139,369,492 rd   +      18,734 wr)
==21803== LLd misses:           25,828  (      7,959 rd   +      17,869 wr)
==21803== D1  miss rate:          26.6% (       53.2%     +         0.0%  )
==21803== LLd miss rate:           0.0% (        0.0%     +         0.0%  )
==21803== 
==21803== LL refs:         139,389,866  (139,371,132 rd   +      18,734 wr)
==21803== LL misses:            27,378  (      9,509 rd   +      17,869 wr)
==21803== LL miss rate:            0.0% (        0.0%     +         0.0%  )

While, using N=530 you got the following output:

Average for size 530: 13264 (us)==22783== 
==22783== I   refs:      1,129,929,859
==22783== I1  misses:            1,640
==22783== LLi misses:            1,550
==22783== I1  miss rate:          0.00%
==22783== LLi miss rate:          0.00%
==22783== 
==22783== D   refs:        561,773,362  (280,923,156 rd   + 280,850,206 wr)
==22783== D1  misses:       32,899,398  ( 32,879,492 rd   +      19,906 wr)
==22783== LLd misses:           26,999  (      7,958 rd   +      19,041 wr)
==22783== D1  miss rate:           5.9% (       11.7%     +         0.0%  )
==22783== LLd miss rate:           0.0% (        0.0%     +         0.0%  )
==22783== 
==22783== LL refs:          32,901,038  ( 32,881,132 rd   +      19,906 wr)
==22783== LL misses:            28,549  (      9,508 rd   +      19,041 wr)
==22783== LL miss rate:            0.0% (        0.0%     +         0.0%  )

As you can see, the D1 misses in 512 is around 3.5 times bigger than in 530

So one solution would be to use the next larger size matrix for a "cache friendly" matrix, leaving some unused columns (and possibly rows), but it should be faster. — rcgldr, Mar 02 '17 at 21:05
Yup, and the high rate of misses is because associativity allows only a fraction of the total cache to be used, under certain memory access patterns. — Ben Voigt, Mar 02 '17 at 21:12
@rcgldr: A better solution would be to change the memory access order. Instead of swapping single elements, swap a 4x4 block, that way you access all the elements in the same cache line for both ends of the swap. — Ben Voigt, Mar 02 '17 at 21:14

Why are these matrix transposition times so counter-intuitive?

1 Answers1