6

I always have thought and known that multidimensional arrays to which indexing is done only once by multiplication is faster than arrays of arrays to which indexing is done by two pointer dereferencing, due to better locality and space saving.

I ran a small test a while ago, and the result was quite surprising. At least my callgrind profiler reported that the same function using array of arrays run slightly faster.

I wonder whether I should change the definition of my matrix class to use an array of arrays internally. This class is used virtually everywhere in my simulation engine (? not exactly sure how to call..), and I do want to find the best way to save a few seconds.

test_matrix has the cost of 350 200 020 and test_array_array has the cost of 325 200 016. The code was compiled with -O3 by clang++. All member functions are inlined according to the profiler.

#include <iostream>
#include <memory>

template<class T>
class BasicArray : public std::unique_ptr<T[]> {
public:
    BasicArray() = default;
    BasicArray(std::size_t);
};

template<class T>
BasicArray<T>::BasicArray(std::size_t size)
: std::unique_ptr<T[]>(new T[size]) {}

template<class T>
class Matrix : public BasicArray<T> {
public:
    Matrix() = default;
    Matrix(std::size_t, std::size_t);
    T &operator()(std::size_t, std::size_t) const;
    std::size_t get_index(std::size_t, std::size_t) const;
    std::size_t get_size(std::size_t) const;

private:
    std::size_t sizes[2];
};

template<class T>
Matrix<T>::Matrix(std::size_t i, std::size_t j)
: BasicArray<T>(i * j)
, sizes {i, j} {}

template<class T>
T &Matrix<T>::operator()(std::size_t i, std::size_t j) const {
    return (*this)[get_index(i, j)];
}

template<class T>
std::size_t Matrix<T>::get_index(std::size_t i, std::size_t j) const {
    return i * get_size(2) + j;
}

template<class T>
std::size_t Matrix<T>::get_size(std::size_t d) const {
    return sizes[d - 1];
}

template<class T>
class Array : public BasicArray<T> {
public:
    Array() = default;
    Array(std::size_t);
    std::size_t get_size() const;

private:
    std::size_t size;
};

template<class T>
Array<T>::Array(std::size_t size)
: BasicArray<T>(size)
, size(size) {}

template<class T>
std::size_t Array<T>::get_size() const {
    return size;
}

static void __attribute__((noinline)) test_matrix(const Matrix<int> &m) {
    for (std::size_t i = 0; i < m.get_size(1); ++i) {
        for (std::size_t j = 0; j < m.get_size(2); ++j) {
            static_cast<volatile void>(m(i, j) = i + j);
        }
    }
}

static void __attribute__((noinline))
test_array_array(const Array<Array<int>> &aa) {
    for (std::size_t i = 0; i < aa.get_size(); ++i) {
        for (std::size_t j = 0; j < aa[0].get_size(); ++j) {
            static_cast<volatile void>(aa[i][j] = i + j);
        }
    }
}

int main() {
    constexpr int N = 1000;
    Matrix<int> m(N, N);
    Array<Array<int>> aa(N);
    for (std::size_t i = 0; i < aa.get_size(); ++i) {
        aa[i] = Array<int>(N);
    }
    test_matrix(m);
    test_array_array(aa);
}
  • 3
    You're doing what many people do: guessing what you should worry about. Rather, let the program tell you what you should focus on - it is probably something else entirely. [*Here's an example.*](http://stackoverflow.com/a/927773/23771) – Mike Dunlavey Aug 07 '15 at 20:06
  • Have you considered converting to raw C? If you really want to save time, my experience is that removing class- and method-resolution can be a huge time saver. –  Aug 07 '15 at 20:06
  • 1
    Maybe you want to benchmark that with an access pattern other than increasing indices in storage order, which allows an optimizing compiler to remove both the multiplications and the majority of indirections. (And what @MikeDunlavey said.) – rici Aug 07 '15 at 20:08
  • @MikeDunlavey Maybe I should explain why I'm doing this. What I meant by a simulation engine is a Go (board game) engine which produces a move by doing millions of Monte-Carlo-like simulations. It is a very heavy computation and if can produce a quality move in 9 seconds where before it used to take 10 seconds, I see it as a good improvement. –  Aug 07 '15 at 20:10
  • @JohnPirie It was originally C, and I still like C much better. I just decided to move to C++ for better flexibility in changing the program structure when needed after some massive re-writings. –  Aug 07 '15 at 20:12
  • 1
    @xiver77: Regardless, no amount of guessing what you should optimize can stand up to good diagnosis. If you are right, it will tell you. If it is something else, it will tell you. If it is something else, and you fix it, then maybe array indexing is the next biggest problem, maybe later. To expect a mere 10% in a freshly-written program is selling yourself short, and there a huge difference between guessing what the problem is, and *knowing* what it really is. – Mike Dunlavey Aug 07 '15 at 20:23
  • @rici Thank you for your suggestion, I've just ran a test with `rand` for array indexing paired with `srand(0)` at function entry to get the same indexing sequence. The result is not much different. `test_array_array` still runs slightly faster. –  Aug 07 '15 at 20:27
  • Direct measurements with `clock_gettime` show that the array version is actually about two times slower than the matrix version. – n. m. could be an AI Aug 07 '15 at 21:11
  • 1
    If you swap the last two lines of main do you get different results? It may just be an issue with caching and memory fragmentation or something. Each test should be called multiple times within one program, alternating between each of the two types. – Aaron McDaid Aug 07 '15 at 21:25
  • @n.m. I did the same, but the results doesn't differ much from that of callgrind. –  Aug 07 '15 at 22:35
  • 1
    @AaronMcDaid Yes! I ran the same test 10 times in a for loop, and one extremely interesting fact is that in the first iteration the matrix version is always slow, but after that the matrix version does run faster as expected. I never thought the order can result in speed difference. Very interesting to know. Thanks! –  Aug 07 '15 at 22:37
  • Which timer did you use? – n. m. could be an AI Aug 08 '15 at 10:16
  • @n.m. `std::chrono::high_resolution_clock` and also `clock_gettime` with `CLOCK_MONOTONIC` out of curiosity. No difference though. –  Aug 08 '15 at 10:25
  • What about CLOCK_PROCESS_CPUTIME_ID? – n. m. could be an AI Aug 08 '15 at 10:28
  • @n.m. I'll have to fix my timer class again for that. Is that likely to make a big difference? I made `N` big enough to let the program run for several seconds. –  Aug 08 '15 at 10:31

3 Answers3

6

The performance of the two approach is nearly the same because the inner-most loop can optimized the same way in both cases and the computation is likely memory-bound. This means the overhead of the indirection is diluted in the rest of the computation which take most of the time and is subject to variations that can actually be bigger than the overhead. Thus the benchmark is not very sensitive to the difference between the two methods. Here is the assembly code of the inner-most loop (left side: matrix, right side: array of array):

.LBB0_17:                                            .LBB1_30:
        movdqa  xmm5, xmm1                                   movdqa  xmm5, xmm1
        paddq   xmm5, xmm4                                   paddq   xmm5, xmm4
        movdqa  xmm6, xmm0                                   movdqa  xmm6, xmm0
        paddq   xmm6, xmm4                                   paddq   xmm6, xmm4
        shufps  xmm5, xmm6, 136                              shufps  xmm5, xmm6, 136 
        movdqa  xmm6, xmm3                                   movdqa  xmm6, xmm3
        paddq   xmm6, xmm1                                   paddq   xmm6, xmm1
        movdqa  xmm7, xmm3                                   movdqa  xmm7, xmm3
        paddq   xmm7, xmm0                                   paddq   xmm7, xmm0
        shufps  xmm6, xmm7, 136                              shufps  xmm6, xmm7, 136 
        movups  xmmword ptr [rdi + 4*rbx - 48], xmm5         movups  xmmword ptr [rsi + 4*rcx], xmm5
        movups  xmmword ptr [rdi + 4*rbx - 32], xmm6         movups  xmmword ptr [rsi + 4*rcx + 16], xmm6
        movdqa  xmm5, xmm0                                   movdqa  xmm5, xmm0
        paddq   xmm5, xmm10                                  paddq   xmm5, xmm10
        movdqa  xmm6, xmm1                                   movdqa  xmm6, xmm1
        paddq   xmm6, xmm10                                  paddq   xmm6, xmm10
        movdqa  xmm7, xmm3                                   movdqa  xmm7, xmm3
        paddq   xmm7, xmm6                                   paddq   xmm7, xmm6
        paddq   xmm6, xmm4                                   paddq   xmm6, xmm4
        movdqa  xmm2, xmm3                                   movdqa  xmm2, xmm3
        paddq   xmm2, xmm5                                   paddq   xmm2, xmm5
        paddq   xmm5, xmm4                                   paddq   xmm5, xmm4
        shufps  xmm6, xmm5, 136                              shufps  xmm6, xmm5, 136 
        shufps  xmm7, xmm2, 136                              shufps  xmm7, xmm2, 136 
        movups  xmmword ptr [rdi + 4*rbx - 16], xmm6         movups  xmmword ptr [rsi + 4*rcx + 32], xmm6
        movups  xmmword ptr [rdi + 4*rbx], xmm7              movups  xmmword ptr [rsi + 4*rcx + 48], xmm7
        add     rbx, 16                                      add     rcx, 16
        paddq   xmm1, xmm11                                  paddq   xmm1, xmm11
        paddq   xmm0, xmm11                                  paddq   xmm0, xmm11
        add     rbp, 2                                       add     rax, 2
        jne     .LBB0_17                                     jne     .LBB1_30

As we can see, the loop basically contains the same instructions for the two methods. The order of the stores (movups) is not the same but this should not impact the execution time (especially if the array is aligned in memory). The same thing applies for the different register names. The loop is vectorized using SIMD instructions (SSE) and unrolled 4 times so it can be pretty fast (4 items can be computed per SIMD unit and 16 items per iteration). About 62 iterations are needed for the inner-most loop to complete.

That being said, in both cases, the loops writes 4*1000*1000 = 3.81 MiB of data. This typically fits in the L3 cache on relatively recent processors (or the RAM on old processors). The throughput of the L3/RAM is limited from a core (far lower than the L1 or even the L2 cache) so 1 core will likely stall waiting for the memory hierarchy to be ready. As a result, the loop are not so fast since they spend most of the time waiting for the memory hierarchy. Hardware prefetchers are pretty efficient on modern x86-64 processors so they can prefetech data before a core actually request it, especially for stores and if the written data is contiguous.

The array of array method is generally less efficient because each sub-array is not guaranteed to be allocated contiguously. Modern memory allocators typically use a bucket-based strategy to find memory blocks fitting to the requested size. In a program like this benchmark, the requested memory can be contiguous (or very close to be) since all the arrays are allocated in a raw and the bucket memory is generally not fragmented when a program starts. However, when the memory is fragmented, the arrays tends to be located in non-contiguous regions causing an effect called memory diffusion. Memory diffusion makes things harder for prefetchers to be efficient causing less efficient load/store. This is generally especially true for loads, but stores also cause loads here on most x86-64 processors (Intel processors or recent AMD ones) due to the write-allocate cache policy. Put it shortly, this is one main reason why the array of array method is generally less efficient in application. But this is not the only one : the other comes from the indirections.

The overhead of the additional indirections is pretty small in this benchmark mainly because of the memory-bound inner-loop. The pointers of the sub-arrays are stored contiguously so they can fit in the L1 cache and be efficiently prefetched. This means the indirections can be fast because they are unlikely to cause a cache miss. The indirection instruction cause additional load instructions but since most of the time in waiting the L3 cache or the RAM, the overhead of such instructions is very small if not even negligible. Indeed, modern processors execute instruction in parallel and in an out-of-order way, so the L1 access can be overlapped with L3/RAM load/stores. For example, Intel processors have dedicated units for that: the Line Fill Buffers (between the L1 and L2 caches), the Super-Queue (between the L2 and L3 cache) and the Integrated Memory Controller (between the L3 and the RAM). Most operations are done kind of asynchronously. That being said, things start to be synchronous when cores stall waiting on incoming data or buffers/queues are saturated.

This is possible with a smaller inner-most loop or if the 2D array is travelled non-contiguously. Indeed, if the inner-most loop only compute few items or if it is even replaced with 1 statement, then the overhead of the indirections are much more visible. The processor cannot (easily) overlap the overhead and the array of array method become slower than the matrix-based approach. here is the result of this new benchmark. The gap between the two method seems small but one should keep in mind that the cache is hot during the benchmark while it may not be in a real-world applications. Having a cold cache benefits to the matrix-based method which need fewer data to be loaded from the cache (no need to load the array of pointers).

To understand why the gap is not so huge, we need to analyse the assembly code again. The full assembly code can be seen on Godbolt. Clang use 3 different strategy to speed up the loop (SIMD, scalar+unrolling and scalar) but the unrolled one is the one that should be actually used in this case. Here is the hot loop for the matrix-based method:

.LBB0_27:
        mov     dword ptr [r12 + rdi], esi
        lea     ebx, [rsi + 1]
        mov     dword ptr [r12 + rdx], ebx
        lea     ebx, [rsi + 2]
        mov     dword ptr [r12 + rcx], ebx
        lea     ebx, [rsi + 3]
        mov     dword ptr [r12 + rax], ebx
        add     rsi, 4
        add     r12, r8
        cmp     rsi, r9
        jne     .LBB0_27

Here is the one for the array of array:

.LBB1_28:
        mov     rbp, qword ptr [rdi - 48]
        mov     dword ptr [rbp], edx
        mov     rbp, qword ptr [rdi - 32]
        lea     ebx, [rdx + 1]
        mov     dword ptr [rbp], ebx
        mov     rbp, qword ptr [rdi - 16]
        lea     ebx, [rdx + 2]
        mov     dword ptr [rbp], ebx
        mov     rbp, qword ptr [rdi]
        lea     ebx, [rdx + 3]
        mov     dword ptr [rbp], ebx
        add     rdx, 4
        add     rdi, 64
        cmp     rsi, rdx
        jne     .LBB1_28

At first glance, the second one seems clearly less efficient because there is far more instructions to execute. But as said previously, modern processors execute instructions in parallel. Thus, the instruction dependencies and especially the critical path play a significant role in the resulting performance (eg. dependency chains), not to mention the saturation of the processor units en more specifically the saturation of back-end ports of computing cores. Since the performance of this loop is strongly dependent of the target of the target architecture, we should consider a specific processor architecture in order to analyse how fast each method is in this case. Lets choose a relatively-recent mainstream architecture: Intel CoffeeLake.

The first loop is clearly bounded by the store instructions (mov dword ptr [...], ...) since there is only 1 store port on this architecture while lea and add instruction can be executed on multiple ports (and the cmp+jne is cheap because it can be macro-fused and predicted). The loop should take 4 cycles per iteration unless it is bound by the memory hierarchy.

The second loop is more complex but it is also bounded by the store instructions mov dword ptr [rbp], edx. Indeed, CofeeLake has two load ports so 2 mov rbp, qword ptr [...] instructions can be executed per cycle; the same thing is true for the lea which can also be executed on 2 ports; the add and cmp+jne are still cheap. The amount of instruction is not sufficiently big so to saturate the front-end so ports are the bottleneck here. In the end, the loop also takes 4 cycles per iteration assuming the memory hierarchy is not a problem. The thing is the scheduling of the instructions is not always perfect in practice so the dependencies to load instruction can introduce a significant latency if something goes wrong. Since there is a higher pressure on the memory hierarchy, a cache miss would cause the second loop to stall for many cycles as opposed to the first loop (which only do writes). Not to mention a cache miss is more likely to happen in the second case since there is a 8KB buffer of pointers to keep in the L1 cache for this computation to be fast: loading items from the L2 takes a dozen of cycle and loading data to the L3 can cause some cache-lines to be evicted. This is why the second loop is slightly slower in this new benchmark.

What if we use another processor? The result can be significantly different, especially since IceLake (Intel) and Zen2 (AMD) as they have 2 store ports. Things are pretty difficult to analyse on such processors (since not a unique port may be the bottleneck nor actually the back-end at all). This is especially true for Zen2/Zen3 having a 2 shared load/store ports and one dedicated only to stores (meaning 2 loads + 1 store scheduled in 1 cycle, or 1 load + 2 stores, or no load + 3 stores). Thus, the best is certainly to run practical benchmarks on such platforms while taking care to avoid benchmarking biases.

Note that the memory alignment of the sub-array is pretty critical too. With N=1024, the matrix-based method can be significantly slower. This is because the memory layout of the matrix-based method is likely to cause cache trashing in this case while the array-of-array-based method typically adds some padding preventing this issue in this very specific case. The thing is the added padding is typically sizeof(size_t) for mainstream bucket-based allocators so the issue is just happening for another value of N and not really prevented. In fact, for N=1022, the array-of-array-based method is significantly slower. This perfectly match with the above explanation since sizeof(size_t) = 2*sizeof(int) = 8 on the target machine (64-bit). Thus, both methods suffers from this issue but it can be easily controlled with the matrix-based method by adding some padding while it cannot be easily controlled with the array-of-array-based method because the implementation of the allocator is dependent of the platform by default.

Jérôme Richard
  • 41,678
  • 6
  • 29
  • 59
  • Can this explain why the timings are the same when optimizations are turned off? (see my non-answer). – alfC Dec 23 '22 at 19:46
  • 1
    When optimization are turned off, the code seems to spend most of its time calling function in both cases (which is horribly inefficient and quite big). For example, `get_size`, `operator[]` and `operator()` are called for each iteration. That being said, the overhead of the matrix class is not as big as the array-of-array since the later has to call more functions which are less efficient (the `operator[]` is called twice and its internal code is a bit less efficient as it is supposed to be optimized). This is confirmed by a [quick run](https://quick-bench.com/q/1G7HNsA2QPMfrgOuykSKCoC2eIQ). – Jérôme Richard Dec 27 '22 at 22:48
  • 1
    Thus, put it shortly: I cannot reproduce the fact that "*the timings are the same when optimizations are turned off*" and I actually proved the opposite. – Jérôme Richard Dec 27 '22 at 22:52
  • you are right, I missread the result. matrix is more robust to (non) optimizations than array of arrays. – alfC Dec 28 '22 at 03:34
1

I haven't looked through your code in a lot of detail. Instead, I tested your implementations against a really simple wrapper around an std::vector, then added a little bit of timing code so I didn't have to run under a profiler to get a meaningful result. Oh, and I really didn't like the code taking a reference to const, then using a cast to void to allow the code to modify the matrix. I certainly can't imagine expecting people to do that in normal use.

The result looked like this:

#include <chrono>
#include <iomanip>
#include <iostream>
#include <memory>
#include <vector>

template <class T>
class BasicArray : public std::unique_ptr<T[]> {
public:
    BasicArray() = default;
    BasicArray(std::size_t);
};

template <class T>
BasicArray<T>::BasicArray(std::size_t size)
    : std::unique_ptr<T[]>(new T[size])
{
}

template <class T>
class Matrix : public BasicArray<T> {
public:
    Matrix() = default;
    Matrix(std::size_t, std::size_t);
    T& operator()(std::size_t, std::size_t) const;
    std::size_t get_index(std::size_t, std::size_t) const;
    std::size_t get_size(std::size_t) const;

private:
    std::size_t sizes[2];
};

template <class T>
Matrix<T>::Matrix(std::size_t i, std::size_t j)
    : BasicArray<T>(i * j)
    , sizes { i, j }
{
}

template <class T>
T& Matrix<T>::operator()(std::size_t i, std::size_t j) const
{
    return (*this)[get_index(i, j)];
}

template <class T>
std::size_t Matrix<T>::get_index(std::size_t i, std::size_t j) const
{
    return i * get_size(2) + j;
}

template <class T>
std::size_t Matrix<T>::get_size(std::size_t d) const
{
    return sizes[d - 1];
}

template <class T>
class Array : public BasicArray<T> {
public:
    Array() = default;
    Array(std::size_t);
    std::size_t get_size() const;

private:
    std::size_t size;
};

template <class T>
Array<T>::Array(std::size_t size)
    : BasicArray<T>(size)
    , size(size)
{
}

template <class T>
std::size_t Array<T>::get_size() const
{
    return size;
}

static void test_matrix(Matrix<int>& m)
{
    for (std::size_t i = 0; i < m.get_size(1); ++i) {
        for (std::size_t j = 0; j < m.get_size(2); ++j) {
            m(i, j) = i + j;
        }
    }
}

static void
test_array_array(Array<Array<int>>& aa)
{
    for (std::size_t i = 0; i < aa.get_size(); ++i) {
        for (std::size_t j = 0; j < aa[0].get_size(); ++j) {
            aa[i][j] = i + j;
        }
    }
}

namespace JVC {
template <class T>
class matrix {
    std::vector<T> data;
    size_t cols;
    size_t rows;

public:
    matrix(size_t y, size_t x)
        : cols(x)
        , rows(y)
        , data(x * y)
    {
    }
    T& operator()(size_t y, size_t x)
    {
        return data[y * cols + x];
    }

    T operator()(size_t y, size_t x) const
    {
        return data[y * cols + x];
    }
    std::size_t get_rows() const { return rows; }
    std::size_t get_cols() const { return cols; }
};

static void test_matrix(matrix<int>& m)
{
    for (std::size_t i = 0; i < m.get_rows(); ++i) {
        for (std::size_t j = 0; j < m.get_cols(); ++j) {
            m(i, j) = i + j;
        }
    }
}
}

template <class F, class C>
void do_test(F f, C &c, std::string const &label) {
    using namespace std::chrono;

    auto start = high_resolution_clock::now();
    f(c);
    auto stop = high_resolution_clock::now();

    std::cout << std::setw(20) << label << " time: ";
    std::cout << duration_cast<milliseconds>(stop - start).count() << " ms\n";
}

int main()
{
    std::cout.imbue(std::locale(""));

    constexpr int N = 20000;
    Matrix<int> m(N, N);
    Array<Array<int>> aa(N);
    JVC::matrix<int> m2 { N, N };

    for (std::size_t i = 0; i < aa.get_size(); ++i) {
        aa[i] = Array<int>(N);
    }
    using namespace std::chrono;

    do_test(test_matrix, m, "Matrix");
    do_test(test_array_array, aa, "array of arrays");
    do_test(JVC::test_matrix, m2, "JVC Matrix");
}

And the result looked like this:

              Matrix time: 1,893 ms
     array of arrays time: 1,812 ms
          JVC Matrix time: 620 ms

So, a trivial wrapper around std::vector is faster than either of your implementations by a factor of about 3.

I would suggest that with this much overhead, it's difficult to be at all certain the timing difference you're seeing stems from storage layout.

Jerry Coffin
  • 476,176
  • 80
  • 629
  • 1,111
  • The benchmark is **biased**. The compiler can inline the function and completely optimize out a function. Clang actually does that on my machine (and apparently on GodBolt too). This result in a "Matrix time: 0 ms" so it is technically the fastest ;) . Besides, the benchmark only compute things once so it is subject to variation due to the allocator and the order of the operations drastically impact the result. The JVC time is only fast if it is the last benchmark launched on my machine. If I change the order of the `do_test`, it is not faster anymore. This is certainly due to pages faults. – Jérôme Richard Dec 23 '22 at 02:54
0

To my surprise, your tests are basically correct. They go against historical knowledge too. (see Dynamic Arrays in C—The Wrong Way).

I corroborated the result with Quickbench and the two timings are almost the same. https://quick-bench.com/q/FhhJTV8IdIym0rUMkbUxvgnXPeA

I have no other alternative to say that since your code is so regular the compiler is figuring out that you are asking for consecutive equal-sized allocations which can be replaced by a single block, and in turn, later the hardware can predict the access pattern.

However, I tried making N volatile and inserting a bunch of randomly interleaved allocations at initialization and still get the same result. I even lowered the optimization to -Og and up to -Ofast and incremented N and I am still getting the same result.

It was only when I used benchmark::ClobberMemory that I see a very small but appreciable difference (with clang, but not with GCC). So it could have to do with the memory access pattern.

quickbench https://quick-bench.com/q/FhhJTV8IdIym0rUMkbUxvgnXPeA

Another thing that did a (small) difference but is important in real applications was to include the initialization step inside the timing, but, still surprisingly, it was only between 5 and 10% (in favor of single block array).

Conclusion: The compiler, or most likely the hardware, must be doing something really amazing. The fact the pointer-indirection version is never really faster than the block array makes me think that something is reducing one case to the other in effect. This deserves more research. Here it is the machine code if someone is interested https://godbolt.org/z/ssGj7aq7j

Afterthought: Before abandoning contiguous arrays I would at least remain suspicious that this result could be an oddity for 2 dimensions and it is not valid for structures of 3 or 4 dimensions.

Disclaimer: This is interesting to me because I am implementing a multidimensional array library and I care about performance. The library is a general of your class Matrix for arbitrary dimensions https://gitlab.com/correaa/boost-multi.

alfC
  • 14,261
  • 4
  • 67
  • 118