1

I am pressed for time to optimize a large piece of C code for speed and I am looking for an algorithm---at the best a C "snippet"---that transposes a rectangular source matrix u[r][c] of arbitrary size (r number of rows, c number of columns) into a target matrix v[s][d] (s = c number of rows, d = r number of columns) in a "cache-friendly" i. e. data-locality respecting way. The typical size of u is around 5000 ... 15000 rows by 50 to 500 columns, and it is clear that a row-wise access of elements is very cache-inefficient.

There are many discussions on this topic in the web (nearby this thread), but as far as I see all of them discuss the spacial cases like square matrices, u[r][r], or the definition an on-dimensional array, e. g. u[r * c], not the above mentioned "array of arrays" (of equal length) used in my context of Numerical Recipes (background see here).

I would by very thankful for any hint that helps to spare me the "reinvention of the wheel".

Martin

Community
  • 1
  • 1
Martin
  • 23
  • 3
  • 1
    If two source elements are close to each other, the corresponding target elements will be far away from each other, and vice versa. How can one possibly hope for a cache-friendly transposition? I'm genuinely curious. – n. m. could be an AI Nov 29 '15 at 16:08
  • 1
    Found [this](http://stackoverflow.com/questions/5200338/a-cache-efficient-matrix-transpose-program). I don't quite understand how it works... – n. m. could be an AI Nov 29 '15 at 16:13
  • 1
    This *question* is a demand for resources or information. SO is not the place for that. Try an academic Q&A: Computer Science or so for example. – Ely Nov 29 '15 at 16:17
  • I actually tried a simple implementation of the cache-oblivious transpose, and got a 10x speed-up compared to the naïve method. – n. m. could be an AI Nov 29 '15 at 17:06
  • @n.m. The key rests on exploiting the fact that a cache line is of a fixed size. Let's say we're dealing with `float`. Even if you're traversing the source matrix sequentially from top-left to bottom-right, there's the initial compulsory miss even traversing a single row of the matrix for every 16 columns (assuming a 64-byte cache line). –  Nov 29 '15 at 17:14
  • @n.m. As a result, it's just as expensive (from the simplified L1 cache perspective) to access that 16th column entry in a single row as it would be to access the first column entry of the next row. We tend to want to think about spatial locality in terms of one giant contiguous block, but it's only relevant, in the context of the L1 cache, e.g., given the size of a cache line. –  Nov 29 '15 at 17:16
  • @n.m. Now take this understanding and you begin to realize the you can start tiling your traversals across the matrix to traverse in short horizontal spans across the input matrix while using all the data in a cache line prior to eviction, while achieving temporal locality by likewise shortening the vertical spans we're traversing for the output matrix. Fewer cache lines then become required for the tiled vertical strides involving the output matrix, while making things no worse by using shorter tiled horizontal spans for the input. –  Nov 29 '15 at 17:23
  • 1
    @Ike yes, got it now, pretty obvious in hindsight. – n. m. could be an AI Nov 29 '15 at 17:25

2 Answers2

1

I do not think that array of arrays is much harder to transpose than linear array in general. But if you are going to have 50 columns in each array, that sounds bad: it may be not enough to hide the overhead of pointer dereferencing.

I think that the overall strategy of cache-friendly implementation is the same: process your matrix in tiles, choose size of tiles which performs best according to experiments.

template<int BLOCK>
void TransposeBlocked(Matrix &dst, const Matrix &src) {
    int r = dst.r, c = dst.c;
    assert(r == src.c && c == src.r);
    for (int i = 0; i < r; i += BLOCK)
        for (int j = 0; j < c; j += BLOCK) {
            if (i + BLOCK <= r && j + BLOCK <= c)
                ProcessFullBlock<BLOCK>(dst.data, src.data, i, j);
            else
                ProcessPartialBlock(dst.data, src.data, r, c, i, j, BLOCK);
        }
}

I have tried to optimize the best case when r = 10000, c = 500 (with float type). On my local machine 128 x 128 tiles give speedup in 2.5 times. Also, I have tried to use SSE to accelerate transposition, but it does not change timings significantly. I think that's because the problem is memory bound.

Here are full timings (for 100 launches each) of various implementations on Core2 E4700 2.6GHz:

Trivial: 6.111 sec
Blocked(4): 8.370 sec
Blocked(16): 3.934 sec
Blocked(64): 2.604 sec
Blocked(128): 2.441 sec
Blocked(256): 2.266 sec
BlockedSSE(16): 4.158 sec
BlockedSSE(64): 2.604 sec
BlockedSSE(128): 2.245 sec
BlockedSSE(256): 2.036 sec

Here is the full code used.

stgatilov
  • 5,333
  • 31
  • 54
0

So, I'm guessing you have an array of array of floats/doubles. This setup is already very bad for cache performance. The reason is that with a 1-dimensional array the compiler can output code that results in a prefetch operation and ( in the case of a very new compiler) produce SIMD/vectorized code. With an array of pointers there's a deference operation on each step making a prefetch more difficult. Not to mention there aren't any guarantees on memory alignment.

If this is for an assignment and you have no choice but to write the code from scratch, I'd recommend looking at how CBLAS does it (note that you'll still need your array to be "flattened"). Otherwise, you're much better off using a highly optimized BLAS implementation like OpenBLAS. It's been optimized for nearly a decade and will produce the fastest code for your target processor (tuning for things like cache sizes and vector instruction set).

The tl;dr is that using an array of arrays will result in terrible performance no matter what. Flatten your arrays and make your code nice to read by using a #define to access elements of the array.

alfalfasprout
  • 243
  • 1
  • 12