data locality for implementing 2d array in c/c++

Question

Long time ago, inspired by "Numerical recipes in C", I started to use the following construct for storing matrices (2D-arrays).

double **allocate_matrix(int NumRows, int NumCol)
{
  double **x;
  int i;

  x = (double **)malloc(NumRows * sizeof(double *));
  for (i = 0; i < NumRows; ++i) x[i] = (double *)calloc(NumCol, sizeof(double));
  return x;
}

double **x = allocate_matrix(1000,2000);
x[m][n] = ...;

But recently noticed that many people implement matrices as follows

double *x = (double *)malloc(NumRows * NumCols * sizeof(double));
x[NumCol * m + n] = ...;

From the locality point of view the second method seems perfect, but has awful readability... So I started to wonder, is my first method with storing auxiliary array or **double pointers really bad or the compiler will optimize it eventually such that it will be more or less equivalent in performance to the second method? I am suspicious because I think that in the first method two jumps are made when accessing the value, x[m] and then x[m][n] and there is a chance that each time the CPU will load first the x array and then x[m] array.

p.s. do not worry about extra memory for storing **double, for large matrices it is just a small percentage.

P.P.S. since many people did not understand my question very well, I will try to re-shape it: do I understand right that the first method is kind of locality-hell, when each time x[m][n] is accessed first x array will be loaded into CPU cache and then x[m] array will be loaded thus making each access at the speed of talking to RAM. Or am I wrong and the first method is also OK from data-locality point of view?

"From the locality point of view the second method seems perfect, but has awful readability... " I would say both are awful, as they both use pointers and `malloc` IMO. — Rakete1111, May 17 '17 at 16:21
@John Smith It is a C approach to allocate arrays. And the compiler can not optimize the memory allocation because it is performed at run-time. — Vlad from Moscow, May 17 '17 at 16:24
@John Smith And "Numerical recipes in C", is not the same as "Numerical recipes in C++":), — Vlad from Moscow, May 17 '17 at 16:26
In C++ you use vector for "re-dimensionable" array, not pointers. — nefas, May 17 '17 at 16:29
@John Smith it is definitely worth it to ensure the 2d array is allocated as a contiguous block. As an answer points out you can use a wrapper to do this while maintaining your "pretty" interface. But what would be even more beneficial in the long run is to use a good matrix library as it will do all of this for you plus all kinds of other optimizations (expression templates, SIMD, etc). — RyanP, May 17 '17 at 16:30
If this is a practical question rather than academic, do what @RyanP suggests. Download Eigen. Then your code becomes `Eigen::MatrixXd x(NumRows, NumCols);` and `x(m, n) = ...;` — Peter, May 17 '17 at 16:50
RyanP, I never understood, why I need I good library for doing simple things. They will do `x[m][n]` access using some super-secret method? Some people say "you have to use STL vectors", but they probably do not remember that few years ago Microsoft implementation of vectors was awful, and vector > was many times slower than simple array... — John Smith, May 17 '17 at 16:54

Paul R · Answer 1 · 2017-05-17T16:42:46.047

3

For C-style allocations you can actually have the best of both worlds:

double **allocate_matrix(int NumRows, int NumCol)
{
  double **x;
  int i;

  x = (double **)malloc(NumRows * sizeof(double *));
  x[0] = (double *)calloc(NumRows * NumCol, sizeof(double)); // <<< single contiguous memory allocation for entire array
  for (i = 1; i < NumRows; ++i) x[i] = x[i - 1] + NumCols;
  return x;
}

This way you get data locality and its associated cache/memory access benefits, and you can treat the array as a double ** or a flattened 2D array (array[i * NumCols + j]) interchangeably. You also have fewer calloc/free calls (2 versus NumRows + 1).

edited May 17 '17 at 16:42

answered May 17 '17 at 16:36

Paul R

208,748
37
389
560

but will this method have a lot of cache misses or not? – John Smith May 17 '17 at 17:03
No, that's the whole point of using a single large contiguous allocation for the underlying data. – Paul R May 17 '17 at 17:08
then may be I misunderstand the way the CPU works? I thought that in this method when you access `x[m][n]` first `x[m]` will be read. In order to read `x[m]` the CPU will download part of the `double **x` array into the cache. Then it will read the value of `x[m]` and will retrieve the actual data, which is of course different part of the memory, thus making CPU to reload the cache line. Is my logic somehow deficient? – John Smith May 17 '17 at 17:16
If you're iterating through the array sequentially then you only need to access the row pointer x[m] once per row. You also have the option to treat the array as flattened (see above), which might be more efficient for a more random access pattern (although such an access pattern is very cache inefficient to begin with of course). – Paul R May 17 '17 at 17:46
but I do not want to access `x[m]` and store it each time I am accessing rows. what happens if I have cycle `for (int m=0; m – John Smith May 17 '17 at 18:00
As I already said: you don't have to - there is no "store" anyway, you just load the row pointer and and add an offset, which is typically all done in one instruction (on x86 at least). You also have the option to use direct addressing (see above) while still retaining compatibility with legacy APIs. – Paul R May 17 '17 at 18:03

Stefan Monov · Answer 2 · 2017-05-17T16:30:21.950

No need to guess whether the compiler will optimize the first method. Just use the second method which you know is fast, and use a wrapper class that implements for example these methods:

double& operator(int x, int y);
double const& operator(int x, int y) const;

... and access your objects like this:

arr(2, 3) = 5;

Alternatively, if you can bear a little more code complexity in the wrapper class(es), you can implement a class that can be accessed with the more traditional arr[2][3] = 5; syntax. This is implemented in a dimension-agnostic way in the Boost.MultiArray library, but you can do your own simple implementation too, using a proxy class.

Note: Considering your usage of C style (a hardcoded non-generic "double" type, plain pointers, function-beginning variable declarations, and malloc), you will probably need to get more into C++ constructs before you can implement either of the options I mentioned.

I do not see a huge problem in using malloc. I am pretty sure that STL vectors use mallocs inside. But anyway, my main question was: will the first method be hugely inefficient because of cache misses when accessing `x[m][n]` or it is more or less the same in that sense as the second method? I am not worried here about the array being not completely contiguous, since I can easily make it contiguous (see for example the answer by Paul R). What I am worried about is having extra array of `double *` and 2 redirections: first from `x` to `x[m]` and then to `x[m][n]` — John Smith, May 17 '17 at 17:11

crazypeter · Answer 3 · 2017-05-17T16:38:00.603

The two methods are quite different.

While the first method allows for easier direct access to the values by adding another indirection (the double** array, hence you need 1+N mallocs), ...
the second method guarantees that ALL values are stored contiguously and only requires one malloc.

I would argue that the second method is always superior. Malloc is an expensive operation and contiguous memory is a huge plus, depending on the application.

In C++, you'd just implement it like this:

std::vector<double> matrix(NumRows * NumCols);
matrix[y * numCols + x] = value;  // Access

and if you're concerned with the inconvenience of having to compute the index yourself, add a wrapper that implements operator(int x, int y) to it.

You are also right that the first method is more expensive when accessing the values. Because you need two memory lookups as you described x[m] and then x[m][n]. There is no way the compiler will "optimize this away". The first array, depending on its size, will be cached, and the performance hit may not be that bad. In the second case, you need an extra multiplication for direct access.

score 0 · Answer 4 · answered May 17 '17 at 16:35

In the first method you use, the double* in the master array point to logical columns (arrays of size NumCol).

So, if you write something like below, you get the benefits of data locality in some sense (pseudocode):

foreach(row in rows):
    foreach(elem in row):
        //Do something

If you tried the same thing with the second method, and if element access was done the way you specified (i.e. x[NumCol*m + n]), you still get the same benefit. This is because you treat the array to be in row-major order. If you tried the same pseudocode while accessing the elements in column-major order, I assume you'd get cache misses given that the array size is large enough.

In addition to this, the second method has the additional desirable property of being a single contiguous block of memory which further improves the performance even when you loop through multiple rows (unlike the first method).

So, in conclusion, the second method should be much better in terms of performance.

score 0 · Answer 5 · answered May 17 '17 at 17:32

If NumCol is a compile-time constant, or if you are using GCC with language extensions enabled, then you can do:

double (*x)[NumCol] = (double (*)[NumCol]) malloc(NumRows * sizeof (double[NumCol]));

and then use x as a 2D array and the compiler will do the indexing arithmetic for you. The caveat is that unless NumCol is a compile-time constant, ISO C++ won't let you do this, and if you use GCC language extensions you won't be able to port your code to another compiler.

data locality for implementing 2d array in c/c++

5 Answers5