3

I have a 2d array SomeData* data[4096][4096]. Here the data is contiguous along the last coordinate, so that iterating over the y coordinate is faster than iterating over the x coordinate, due to memory locality. However, the access pattern I have is that I look at an entry, and then look at nearby neighbours in both coordinates, i.e. data[x][y] along with data[x-1][y-1], data[x+1][y+1], etc.

If I could allocate this array such that small 2d sub-blocks were contiguous in memory, that would speed things up.

I say allocate, but I suspect the solution here is to allocate a 2d array normally, and then do some trick with the indexing, such that I'm accessing 2d blocks contiguously. In other words, I want some lookup function that translates the coordinates, but I can't immediately see what it should be.

SomeData* data[4096][4096];

SomeData* lookup(size_t x, size_t y) {
    //??? Some logic to translate x,y such that 2d blocks are accessed contiguously.
}

The data array is guaranteed to have both dimensions be a power of two.

larspars
  • 1,640
  • 1
  • 15
  • 30
  • 1
    Allocate a 1D array and a couple of functions to map the 2D coordinates to 1D. – Richard Critten Nov 05 '19 at 10:43
  • You could create an ADT, and store all the data in a one-dimensional array (so: `SomeData * data[4096*4096]`), the access would then be `data[x + y * w]` where `w` is the width of a row. – Florian Humblot Nov 05 '19 at 10:43
  • Yup, any time the language's 2D indexing isn't exactly what you want, or sometimes even if you want to use runtime variable dimensions, you just roll your own indexing. It can be exactly equivalent, or not if you do it differently. – Peter Cordes Nov 05 '19 at 10:54
  • Check [this question](https://stackoverflow.com/questions/47120994/does-stdarray-of-stdarray-have-contiguous-memory) or [this question](https://stackoverflow.com/questions/9762662/is-the-data-in-nested-stdarrays-guaranteed-to-be-contiguous/9762712#9762712) which treat of this problem for `std::array`. – Clonk Nov 05 '19 at 10:55
  • Do you want to iterate over every element and check its neighbors? Or do you want to iterate 2d-subblock wise? Because for the first one is not possible to have every neighbor nearby. – phön Nov 05 '19 at 11:05
  • I made [3 examples](https://stackoverflow.com/a/58122229/7582247) for a similar question. 1. using a 1D vector for the data. 2. using an uninitialized dynamic array (faster to create). 3. Inheriting a vector (yes, I know) as a lookup table but using an uninitialized dynamic array for data. Seems to give faster access to the data. – Ted Lyngmo Nov 05 '19 at 11:17
  • I am aware that I can allocate a 1d array, and index like [x + y * w]. With this I can get memory to be contiguous along either coordinate. But that's not what I want. What I want is to index the array such that e.g. 16x16 blocks are contiguous. So data[0][0] up to data[15][15] is contiguous in memory. – larspars Nov 05 '19 at 11:36
  • @phön I want to check neighbours a few steps up/down/left/right/diagonal. Obviously I can't have everything be local to everything, but I suspect having e.g. 16x16 blocks be contiguous is achievable simply by manipulating indexes. – larspars Nov 05 '19 at 11:43
  • 2
    @larspars replace your placeholder type "SomeData" with a type representing your "block". So you have a 2d-array holding blocks. if your blocks are 16x16, divide your data dimensions each by 16. – phön Nov 05 '19 at 12:12
  • @phön That's a nice pragmatic approach, that would solve the problem. For reasons specific to my code base I'd prefer to do it purely with indexing into a single array, which seems possible. But what you suggest is a good approach. – larspars Nov 05 '19 at 12:49
  • You may want to take a look at space filling curves (https://en.wikipedia.org/wiki/Space-filling_curve). If you are just trying to optimize your code, it may be enough to split the traversal of the inner dimension: `for(int block = 0; block < yMax; block += 64) for(int x = 0; x < xMax; x++) for(int y = block; y < block + 64 && y < yMax; y++) { /* do stuff to data[x][y] */ }` With this approach you can easily ensure that three successive lines of data fit into your L1 cache, so accesses are fast and data needs only to be loaded/written to memory once. – cmaster - reinstate monica Nov 05 '19 at 14:20

1 Answers1

1

Lets say we have a nxm-grid. We want to subdivide the grid into bxb-blocks. It is necessary that n%b=0 and m%b=0.

Lets call the overall indices I=0,...,n-1 and J=0,...,m-1 and the indices in a block i=0,...,b-1 and j=0,...,b-1.

I have tried to sketch the layout here.

Given I, the column index of the block is (I/b) and the in-block-index i=I%b. The row-index of the block is (J/b) and the in-block-index j=J%b.

Each full block contins b*b elements. Therefore a full row of blocks contains (n/b)bb = n*b elements.

Putting all together the linear index of element (I,J) is:

  • (I%b) [the column in the block preceding the element]
  • +(J%b) * b [the rows in the block preceding the element]
  • +(I/b) * b*b [the column of blocks preceding the block of the element]
  • +(J/b) * n*b [the row of blocks preceding the block of the element]

A rough sketch of a runtime-sized blocked-grid-class:

template <typename T>
class Block_Grid
{
public:
  Block_Grid(std::size_t n, std::size_t m, std::size_t b)
  : _n(n), _m(m), _b(b), _elements(n * m)
  { 
    assert(n % b == 0);
    assert(m % b == 0);
  }

  T & operator()(std::size_t i, std::size_t j)
  {
    return _elements[index(i, j)];
  }

protected:
private:
  std::size_t index(int i, int j) const
  {
    return (i % b) 
           + (j % b) * b
           + (i / b) * b * b
           + (j / b) * n * b;
  }

  std::size_t _n;
  std::size_t _m;
  std::size_t _b;
  std::vector<T> _elements;
};

or a compile-time-sized class

template <typename T, std::size_t N, std::size_t M, std::size_t B>
class Block_Grid
{
  static_assert(N % B == 0);
  static_assert(M % B == 0);
public:
  Block_Grid() = default;

  T & operator()(std::size_t i, std::size_t j)
  {
    return _elements[index(i, j)];
  }

protected:
private:
  std::size_t index(std::size_t i, std::size_t j) const
  {
    return (i % B) + (j % B) * B + (i / B) * B * B + (j / B) * N * B;
  }

  std::array<T, N * M> _elements;
};
MadScientist
  • 3,390
  • 15
  • 19
  • With runtime-variable `b`, `index()` is horribly inefficient. The compiler won't even know it's power of 2 to allow it to use `j%b = j&(b-1)` ; you'll probably get an actual `div` instruction. On x86-64, that'll be a 64-bit `div` which is a few times slower than 32-bit `div`, but even that's a non-starter for performance. You need iterators or some kind of tile-aware array traversal. (gcc and/or clang might do something smart in a loop, like some kind of strength-reduction from division to a branch of `itmp < b` or something, but you should probably do that manually.) – Peter Cordes Nov 05 '19 at 14:13
  • You can templatize the class w.r.t. n, m, and b, of course. – MadScientist Nov 05 '19 at 14:16
  • Yes, and doing that for `b` is necessary for performance the way you wrote it. The point of the question is optimizing memory layout for performance, but division is so expensive it would eat up those gains. – Peter Cordes Nov 05 '19 at 14:32
  • Awesome answer. Exactly what I was looking for, and from a quick test the implementation looks correct. – larspars Nov 05 '19 at 14:46
  • @PeterCordes Maybe [explicitly](https://wandbox.org/permlink/k5X6q2SypKgHn18B) using a power of 2, would be better. – Bob__ Nov 05 '19 at 15:03
  • @Bob__: yes, in GNU C I'd consider using `if (b & (b - 1)) __builtin_unreachable()` at the top of these functions to promise the compiler a power of 2 (only 1 bit set). And validate that in the constructor. Or for the template version, maybe take it as a log2(b). Either way I'd check the asm of my loop to make sure it optimized nicely. – Peter Cordes Nov 05 '19 at 15:07