Understanding efficient contiguous memory allocation for a 2D array

Question

The following code is from pg. 93 of Parallel and High Performance Computing and is a single contiguous memory allocation for a 2D array:

double **malloc_2D(int nrows, int ncols) {

  double **x = (double **)malloc(
      nrows*sizeof(double*) 
      + nrows*ncols*sizeof(double));  // L1

  x[0] = (double *)x + nrows;         // L2  
                                                 
  for (int j = 1; j < nrows; j++) {   // L3                                              
    x[j] = x[j-1] + ncols;
  }

  return x;
}

The book states that this improves memory allocation and cache efficiency. Is there any reason w.r.t efficiency to prefer the first code to something like the below code? It seems like the below code is more readable, and it's also easily usable with MPI (I only mention this because the book also covers MPI later).

double *malloc_2D(int nrows, int ncols) {
  double *M = (double *)malloc(nrows * ncols * sizeof(double))
  return M
}

I include the below image to make sure that my mental model of the first code is correct. If it is not, please mention that in the answer. The image is the result of calling the first function to create a 5 x 2 matrix. Note that I just write the indices in the boxes in the below image for clarity, of course the values stored at these memory locations will not be 0 through 14. Also note that L# refers to lines in the first code.

After the first code you can write `arr[i][j]`, in the second code you have to write `arr[i * nrows + j]` every time. — yeputons, Dec 25 '22 at 23:03
But wouldn't you have to do the `arr[i * nrows +j]` style indexing anyways if you were to use something like MPI? Because MPI primitives expect a pointer to a contiguous chunk of memory to be sent and received, so even if you started with something that could be nicely indexable like with the first function, in child processes you end up being forced to do the `i*nrows+j` indexing, right? In any case, maybe there are any performance gains with `arr[i][j]`? It doesn't seem like there would be, but idk. The authors surely mentioned this style of array for a reason. But it's not obvious to me why — Jared Frazier, Dec 25 '22 at 23:07
You can take the best of both worlds (contiguous allocation and nice indexing) by using a pointer to VLA. Just do: `double (*arr)[ncols] = calloc(nrows, sizeof *arr);`. — tstanisl, Dec 25 '22 at 23:36
@tstanisl IMO the best way. Avoiding two levels of indirection — 0___________, Dec 26 '22 at 00:15
See [**Correctly allocating multi-dimensional arrays**](https://stackoverflow.com/questions/42094465/correctly-allocating-multi-dimensional-arrays) — Andrew Henle, Jan 01 '23 at 13:49

Eric Postpischil · Accepted Answer · 2022-12-26T13:19:11.837

The book states that this improves memory allocation and cache efficiency.

The book’s code improves efficiency relative to a too-often seen method of allocating pointers separately, as in:

double **x = malloc(nrows * sizeof *x);
for (size_t i = 0; i < nrows; ++i)
    x[i] = malloc(ncols * sizeof *x[i]);

(Note that all methods should test the malloc result and handle allocation failures. This is elided for the discussion here.)

That method allocates each row separately (from other rows and from the pointers). The book’s method has some benefit that only one allocation is done and that the memory for the array is contiguous. Also, the relationships between elements in different rows are known, and that may allow programmers to take advantage of the relationships in designing algorithms that work well with cache and memory access.

Is there any reason w.r.t efficiency to prefer the first code to something like the below code?

Not for efficiency, no. Both the book’s method and the method above have the disadvantage that they generally require a pointer lookup for every array access (aside from the base pointer, x). Before the processor can get an element from the memory of a row, it has to get the address of the row from memory.

With the method you show, this additional lookup is unnecessary. Further, the processor and/or the compiler may be able to predict some things about the accesses. For example, with your method, the compiler may be able to see that M[(i+1)*ncols + j] is a different element from M[(i+2)*cols + j], whereas with x[i+1][j] and x[i+2][j], it generally cannot know the two pointers x[i+1] and x[i+2] are different.

The book’s code is also defective. The number of bytes it allocates is nrows*sizeof(double*) + nrows*ncols*sizeof(double). Lets say r is nrows, c is ncols, p is sizeof(double*) and d is sizeof(double). Then the code allocates rp + rcd bytes. Then the code sets x[0] to (double *)x + nrows. Because it casts to double *, the addition of nrows is done in units of the pointed-to type, double. So this adds rd bytes to the starting address. And, after that, it expects to have all the elements of the array, which is rcd bytes. So the code is using rd + rcd bytes even though it allocated rp + rcd. If p > d, some elements at the end of the array will be outside of the allocated memory. In current ordinary C implementations, the size of double * is less than or equal to the size of double, but this should not be relied on. Instead of setting x[0] to (double *)x + nrows;, it should calculate x plus the size of nrows elements of type double * plus enough padding to get to the alignment requirement of double, and it should include that padding in the allocation.

If we cannot use variable length arrays, then the array indexing can be provided by a macro, as by defining a macro that replaces x(i, j) with x[i*ncols+j], such as #define x(i, j) x[(i)*ncols + (j)].

All good. But whether a pointer dereference is slower than the multiply needed for the rectangular block allocation is very architecture dependent. Certainly in an embedded environment where multiply might even be in software, the pointer can be much faster. — Gene, Dec 26 '22 at 02:49
This is a very thorough answer. The only part I'm a bit confused about is the "book's code is defective part." I think the discussion on byte alignment is similar to [this stackoverflow question](https://stackoverflow.com/questions/2846914/what-is-meant-by-memory-is-8-bytes-aligned); however I still don't really understand it in the context of the current code. I'm concerned that my mental model (as shown in the figure I included) is wrong because I didn't catch this issue, and I'd like to correct my understanding if possible. Would you explain this a little more, please? — Jared Frazier, Dec 26 '22 at 12:19
Also, w.r.t to defining a macro such that `x(i,j)` is replaced with `x[i*ncols+j]`, how could you do that in C? I've done this style of indexing before by just defining a function like ```size_t ix(size_t i, size_t j, size_t ncols) { return i*ncols+j; }``` and then `x[ix(i, j, ncols)]`. — Jared Frazier, Dec 26 '22 at 12:45
@JaredFrazier: I described the defect in the book’s code incorrectly. I have updated the answer for that. And I added a macro definition showing how the array indexing could be done. — Eric Postpischil, Dec 26 '22 at 13:20

Understanding efficient contiguous memory allocation for a 2D array

1 Answers1