4

I've been searching the web (and stackoverflow) for opinions on whether or not 1-dimensional arrays (or vectors) are faster than their 2-dimensional counterparts. And the general conclusion seems to be that 1-dimensional are the fastest. However, I wrote a short test program to see for myself, and it shows that 2-dimensional are the best. Can anyone find a bug in my test, or at least explain why I get this result?

I use it for storing matrices, and thus need to index the 1-dimensional arrays with both row and column.

#include <iostream>
#include <chrono>
#include <vector>

uint64_t timestamp()
{
    namespace sc = std::chrono;
    static auto start = sc::high_resolution_clock::now();
    return sc::duration_cast<sc::duration<uint64_t, std::micro>>(sc::high_resolution_clock::now() - start).count();
}

int main(int argc, char** argv)
{
    if (argc < 3)
        return 0;
    size_t size = atoi(argv[1]);
    size_t repeat = atoi(argv[2]);

    int** d2 = (int**)malloc(size*sizeof(int*));
    for (size_t i = 0; i < size; ++i)
        d2[i] = (int*)malloc(size*sizeof(int));

    int* d1 = (int*)malloc(size*size*sizeof(int));

    std::vector<std::vector<int> > d2v(size);
    for (auto& i : d2v)
        i.resize(size);

    std::vector<int> d1v(size*size);

    uint64_t start, end;
    timestamp();

    start = timestamp();
    for (size_t n = 0; n < repeat; ++n)
    {
        for (size_t r = 0; r < size; ++r)
        {
            for (size_t c = 0; c < size; ++c)
            {
                if (r == 0)
                    d2[r][c] = 0;
                else
                    d2[r][c] = d2[r-1][c] + 1;
            }
        }
    }
    end = timestamp();
    std::cout << "2D array\t" << size << "\t" << end - start << std::endl;

    start = timestamp();
    for (size_t n = 0; n < repeat; ++n)
    {
        for (size_t c = 0; c < size; ++c)
        {
            for (size_t r = 0; r < size; ++r)
            {
                if (r == 0)
                    d2[r][c] = 0;
                else
                    d2[r][c] = d2[r-1][c] + 1;
            }
        }
    }
    end = timestamp();
    std::cout << "2D array C\t" << size << "\t" << end - start << std::endl;

    start = timestamp();
    for (size_t n = 0; n < repeat; ++n)
    {
        for (size_t r = 0; r < size; ++r)
        {
            for (size_t c = 0; c < size; ++c)
            {
                if (r == 0)
                    d1[r + c*size] = 0;
                else
                    d1[r + c*size] = d1[r-1 + c*size] + 1;
            }
        }
    }
    end = timestamp();
    std::cout << "1D array\t" << size << "\t" << end - start << std::endl;

    start = timestamp();
    for (size_t n = 0; n < repeat; ++n)
    {
        for (size_t c = 0; c < size; ++c)
        {
            for (size_t r = 0; r < size; ++r)
            {
                if (r == 0)
                    d1[r + c*size] = 0;
                else
                    d1[r + c*size] = d1[r-1 + c*size] + 1;
            }
        }
    }
    end = timestamp();
    std::cout << "1D array C\t" << size << "\t" << end - start << std::endl;

    start = timestamp();
    for (size_t n = 0; n < repeat; ++n)
    {
        for (size_t r = 0; r < size; ++r)
        {
            for (size_t c = 0; c < size; ++c)
            {
                if (r == 0)
                    d2v[r][c] = 0;
                else
                    d2v[r][c] = d2v[r-1][c] + 1;
            }
        }
    }
    end = timestamp();
    std::cout << "2D vector\t" << size << "\t" << end - start << std::endl;

    start = timestamp();
    for (size_t n = 0; n < repeat; ++n)
    {
        for (size_t c = 0; c < size; ++c)
        {
            for (size_t r = 0; r < size; ++r)
            {
                if (r == 0)
                    d2v[r][c] = 0;
                else
                    d2v[r][c] = d2v[r-1][c] + 1;
            }
        }
    }
    end = timestamp();
    std::cout << "2D vector C\t" << size << "\t" << end - start << std::endl;

    start = timestamp();
    for (size_t n = 0; n < repeat; ++n)
    {
        for (size_t r = 0; r < size; ++r)
        {
            for (size_t c = 0; c < size; ++c)
            {
                if (r == 0)
                    d1v[r + c*size] = 0;
                else
                    d1v[r + c*size] = d1v[r-1 + c*size] + 1;
            }
        }
    }
    end = timestamp();
    std::cout << "1D vector\t" << size << "\t" << end - start << std::endl;

    start = timestamp();
    for (size_t n = 0; n < repeat; ++n)
    {
        for (size_t c = 0; c < size; ++c)
        {
            for (size_t r = 0; r < size; ++r)
            {
                if (r == 0)
                    d1v[r + c*size] = 0;
                else
                    d1v[r + c*size] = d1v[r-1 + c*size] + 1;
            }
        }
    }
    end = timestamp();
    std::cout << "1D vector C\t" << size << "\t" << end - start << std::endl;

    return 0;
}

I get the following output:

user@user-debian64:~/matrix$ ./build/test/index_test 1000 100
2D array    1000    79593
2D array C  1000    326695
1D array    1000    440695
1D array C  1000    262251
2D vector   1000    73648
2D vector C 1000    418287
1D vector   1000    371433
1D vector C 1000    269355
user@user-debian64:~/matrix$ ./build/test/index_test 10000 1
2D array    10000   149748
2D array C  10000   3507346
1D array    10000   2754570
1D array C  10000   257997
2D vector   10000   92041
2D vector C 10000   3791745
1D vector   10000   3384403
1D vector C 10000   266811
Jonas
  • 6,915
  • 8
  • 35
  • 53
  • Keywords: Locality, cache. – Seyf Nov 17 '14 at 14:04
  • Yes, but that seems to be the reason people say 1-dimensional are fastest. – Jonas Nov 17 '14 at 14:06
  • Oh, just checked the code. You are doing way wrong while iterating through 1D arrays. See my answers in minutes. – Seyf Nov 17 '14 at 14:10
  • 1
    FWIW, here are the results on Arch Linux 64-bit on an Intel Core i7, using Intel, Clang, and GCC compilers: http://pastebin.com/MfsQzf5N – rubenvb Nov 17 '14 at 14:23
  • @rubenvb Did you use optimisation flags? – Jonas Nov 17 '14 at 14:51
  • @Jonas: `-march=x86-64 -mtune=generic -O2` for Clang/GCC, `-O2 -falign-functions=16` for Intel. – rubenvb Nov 17 '14 at 14:53
  • @rubenvb Would `-ftree-vectorize` or `-O3` not make the GCC version faster? I usually use `-O3`. – Jonas Nov 17 '14 at 14:54
  • @Jonas: Adding both options makes up to ~14% difference for the "2D array" case and less or negligible for the remaining tests. `-O3` isn't always faster though, and adds stuff like `-fast-math` which may be unwanted. Don't just use `-O3`. You might as well [optimize for size](http://stackoverflow.com/questions/19470873/why-does-gcc-generate-15-20-faster-code-if-i-optimize-for-size-instead-of-speed) then instead. `-O2` should be a sane default. Measure if you need to know. – rubenvb Nov 17 '14 at 15:02
  • @rubenvb Thanks, I've been running on a virtual machine and really appreciate those results. According to https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html `-O3` does not include `-ffast-math`. However, `-O2 -ftree-loop-vectorize -finline-functions` might be all I need. – Jonas Nov 17 '14 at 19:40
  • @Jonas I stand corrected. – rubenvb Nov 17 '14 at 20:42

2 Answers2

2

The way you are iterating through the 1D array is wrong. You don't need a nested loop in a 1D array. It not only is unnecessary, but brings extra math work to calcualte the index. Instead of this part,

for (size_t c = 0; c < size; ++c)
{
    for (size_t r = 0; r < size; ++r)
    {
        if (r == 0)
            d1[r + c*size] = 0;
        else
            d1[r + c*size] = d1[r-1 + c*size] + 1;
    }
}

you should write

for (size_t r = 0; r < size*size; ++r)
{
    if (r == 0)
        d1[r] = 0;
    else
        d1[r] = d1[r-1] + 1;
}

and it will be fine.

Seyf
  • 889
  • 8
  • 16
  • My use case is storage for matrices, therefor I use rows and cols to fill the matrix with integers between `0` and `size-1`. So I need to be able to index via rows and cols. – Jonas Nov 17 '14 at 14:45
  • 2
    This gives different results. – Peter Nov 17 '14 at 14:46
2

The root of the problem is that your storage order is different between the two schemes.

Your 2D structures are stored row-major. By dereferencing the row first, you arrive at a single buffer which can be directly indexed by column. Neighboring columns are in adjacent memory locations.

Your 1D structures are stored column-major. Neighboring columns are size elements apart in memory.

Trying both orders of iteration covers almost all of the effect. But what's left is the data dependence. By referring to D(r-1,c), the access patterns are completely different between row- and column- major.

Sure enough, changing the 1D indexing to d1[r*size + c] and d1[(r-1)*size + c] produces the following timing:

2D array    1000    78099
2D array C  1000    878527
1D array    1000    19661
1D array C  1000    729280
2D vector   1000    61641
2D vector C 1000    741249
1D vector   1000    18348
1D vector C 1000    726231

So, we still have to explain it. I'm going with the "loop-carried dependency". When you iterated the column-major 1D array in column-major order (good idea), each element depended on the element computed in the previous iteration. That means the loop can't be fully pipelined, as the result has to be fully computed and written back to cache, before it can be read again to compute the next element. In row-major, the dependence is now an element that was computed long ago, which means the loop can be unrolled and pipelined.

Peter
  • 14,559
  • 35
  • 55
  • Did you use any optimisation flags when compiling? When I use `-O3` the 1D & 2D vector has almost the same speed as 1D array. – Jonas Nov 18 '14 at 07:38
  • 1
    Yes, these timings are with GCC 4.8 using -O3. A factor of at least 50% persists across different optimization settings, EXCEPT when I use -O3 and DISABLE vectorization. In this case, the 1D array is only slightly faster than the 2D. So auto-vectorization is playing into these timings, as well. – Peter Nov 18 '14 at 14:20