How to break an array into blocks

Question

I have an array that represents points in a cuboid. It is a one dimensional array, which uses the following indexing function to realise the 3 dimensions:

int getCellIndex(int ix, int iy, int iz) {
    return ix + (iy * numCellsX) + (iz * numCellsX * numCellsY);
}

The number of cells in the domain is:

numCells = (numX + 2) * (numY + 2) * (numZ + 2)

Where numX/numY/numZ are the number of cells in the X/Y/Z direction. The +2 in each direction is to create padding cells around the outside of the domain. The number of cells in each direction is give by:

numX = 5 * numY
numZ = numY/2
numY = userInput

For each cell, I want to calculate a new value for that cell based upon it's neighbours value (i.e. a stencil), where it's neighbours are above, below, left, right, front and back. However, I only want to do this calculation for cells that aren't bad. I have a boolean array that tracks if a cell is bad. This is what the computation currently looks like:

for(int z = 1; z < numZ+1; z++) {
    for(int y = 1; y < numY+1; y++) {
        for(int x = 1; x < numX+1; x++) {
            if(!isBadCell[ getCellIndex(x,y,z) ] {
                // Do stencil Computation
            }
        }
    }
}

This is not great performance wise. I want to be able to vectorize the loop to improve performance, however I can't because of the if statement. I know if cells are bad in advance and this does not change throughout the computation. I'd like to split the domain up into blocks, preferably 4x4x4 blocks, so that I can calculate a priori per block if it contains bad cells, and if so process it as usual, or if not, use an optimized function that can take advantage of vectorization e.g.

for(block : blocks) {
    if(isBadBlock[block]) {
        slowProcessBlock(block) // As above
    } else {
        fastVectorizedProcessBlock(block)
    }
}

NOTE: It is not required for the blocks to physically exist i.e. this could be achieved by changing the indexing function, and using different indexes to loop over the array. I'm open to whatever works best.

The fastVectorizedProcessBlock() function would look similar to the slowProcessBlock() function, but with the if statement remove (since we know it doesn't contain bad cells), and a vectorization pragma.

How can I split up my domain into blocks so that I can accomplish this? It seems tricky because a) the number of cells in each direction are not equal, b) we need to take into account the padding cells, as we must never attempt to calculate their value, as this would lead to a memory access that is out of bounds.

How can I then process the blocks that don't contain bad cells without using an if statement?

EDIT:

This is the idea I originally had:

for(int i = 0; i < numBlocks; i++) { // use blocks of 4x4x4 = 64
    if(!isBadBlock[i]) {
        // vectorization pragma here
        for(int z = 0; z < 4; z++) {
            for(int y = 0; y < 4; y++) {
                for(int x = 0; x < 4; x++) {
                    // calculate stencil using getCellIndex(x,y,z)*i
                }
             }
         }
     } else {
         for(int z = 0; z < 4; z++) {
            for(int y = 0; y < 4; y++) {
                for(int x = 0; x < 4; x++) {
                    if(!isBadCell[i*getCellIndex(x,y,z)]) {    
                    // calculate stencil using getCellIndex(x,y,z)*i
                }
             }
         }
     }
 }

Cells would now be stored in blocks, i.e. all the cells in the first 4x4x4 block would be stored in pos 0-63, then all cells in the second block would be stored in pos 64-127 etc.

However, I don't think will work if the numX/numY/numZ values are not kind. For example, what if numY = 2, numZ = 1 and numX = 10? The for loops would expect the z direction to be at least 4 cells deep. Is there a good way to get past this?

UPDATE 2 - Here is what the stencil computation looks like:

if ( isBadCell[ getCellIndex(x,y,z) ] ) {
  double temp = someOtherArray[ getCellIndex(x,y,z) ] +
                    1.0/CONSTANT/CONSTANT*
                    (
                      - 1.0 * cells[ getCellIndex(x-1,y,z) ]
                      - 1.0 * cells[ getCellIndex(x+1,y,z) ]
                      - 1.0 * cells[ getCellIndex(x,y-1,z) ]
                      - 1.0 * cells[ getCellIndex(x,y+1,z) ]
                      - 1.0 * cells[ getCellIndex(x,y,z-1) ]
                      - 1.0 * cells[ getCellIndex(x,y,z+1) ]
                      + 6.0 * cells[ getCellIndex(x,y,z) ]
                      );
  globalTemp += temp * temp;
  cells[ getCellIndex(x,y,z) ] += -omega * temp / 6.0 * CONSTANT * CONSTANT;
}

Side tip: rather than `ix + (iy * numCellsX) + (iz * numCellsX * numCellsY)`, use a chained computation `ix + numCellsX*(iy + iz * numCellsY)`, one less `*`. — chux - Reinstate Monica, Jan 27 '17 at 03:21
If all blocks aren't 4x4x4, then how do you determine which ones that aren't? Do you allocate an array with the exact size `X*Y*Z` or do you round this so that you end up with exactly `(X*Y*Z) / (4*4*4)` cubes? I think you need to clear this out before the question can be answered. — Lundin, Jan 27 '17 at 14:40
@Lundin Blocks that aren't 4x4x4 could be padded to make them 4x4x4, however then I'd need to know which cells are padded cells when I loop through each block, which would require an if statement stopping vectorization. Any ideas how to get around this? — JC2188, Jan 29 '17 at 04:17
I am assuming that these are set up something like `int cells[numCellsZ][numCellsY][numCellsX]`... in which case case can't you just take the offset from cells[0][0][0] by doing something like `(int *) &cells[z][y][x] - (int *) &cells[0][0][0]` — technosaurus, Jan 29 '17 at 04:51
@technosaurus the cells are declared cells[numCells] where numCells = (numX+2) + (numY+2) + (numZ+2). See the indexing function for the layout. Can you explain your solution a little bit more? I think it would only give me the global index of a cell, which is what the get index function does. — JC2188, Jan 29 '17 at 13:11
An if-statement by itself does not preclude vectorization, there just won't be a per-lane branch so some lanes will do useless work, and you'd have to able to ignore those lanes (in this case I imagine it's OK to write back an unchanged value). If you post the stencil computation we'd have more to work with there, the problem does not depend solely on the "if". — harold, Jan 29 '17 at 13:20
You say you use 7 cells centered around each cell to compute the stencil value. Do you make sure the border cells have an appropriate value for the computation to give correct values for the edge cells? Also do bad cells participate in the computation when the center cell is no bad? — chqrlie, Jan 29 '17 at 13:24
@chqrlie There is a layer of padding around the domain such that the outer layer of non-padding cells will use the value of the padded cells in their stencil computation. The padded cells have their values normalized separately after each run through the domain. Essentially, bad cells are clustered together, into squares/cuboids for example. For good cells that have a boundary with bad cells, the value of the bad cell is used, however this means that only the outer layer of bad cells are used. The values of the outer layer of bad cells are updated after each run through the domain. — JC2188, Jan 29 '17 at 13:56
@harold Please see update 2 for the stencil computation. I should say that it is required that these cells are broken up into blocks, as the blocks are going to be used somewhere else. — JC2188, Jan 29 '17 at 14:08
1) make the address-calculation function `static` to allow inlining. 2) use unsigned indexes, preferrably `size_t` 3) avoid conditional jumps inside loops (except maybe for `break` and `continue` ) — wildplasser, Jan 29 '17 at 15:01
Blocking will not help much. You can solve almost all of your problems by switching from a boolean map of bad cells to a list of bad cell indexes. See my suggestions below for details. (It's not really an answer, because you list *blocking* as a prerequisite. Which, in my experience, does not help here.) — Nominal Animal, Jan 30 '17 at 14:23

chqrlie · Accepted Answer · 2017-01-29T22:54:03.603

Where does getCellIndex() retrieve the values of numCellX and numCellY? It would be better to pass them as arguments instead of relying on global variables, and to make this function static inline to allow the compiler to optimize.

static line int getCellIndex(int ix, int iy, int iz, int numCellsX, numCellsY) {
    return ix + (iy * numCellsX) + (iz * numCellsX * numCellsY);
}

for (int z = 1; z <= numZ; z++) {
    for (int y = 1; y <= numY; y++) {
        for (int x = 1; x <= numX; x++) {
            if (!isBadCell[getCellIndex(x, y, z, numX + 2, numY + 2)] {
                // Do stencil Computation
            }
        }
    }
}

You could also remove all multiplications with some local variables:

int index = (numY + 2) * (numX + 2);  // skip top padding plane
for (int z = 1; z <= numZ; z++) {
    index += numX + 2;  // skip first padding row
    for (int y = 1; y <= numY; y++) {
        index += 1;   // skip first padding col
        for (int x = 1; x <= numX; x++, index++) {
            if (!isBadCell[index] {
                // Do stencil Computation
            }
        }
        index += 1;   // skip last padding col
    }
    index += numX + 2;   // skip last padding row
}

Whether these directions are promissing or not depends a lot on the actual computations performed to get the stencil value. You should post that too.

If you can change the format of the boolean array for bad cells, it would be useful to pad the lines to a multiple of 8 and to use horizontal padding of 8 columns to improve alignment. Making the boolean array an array of bits allows to check for 8, 16, 32 or even 64 cells at a time with a single test.

You can adjust the array pointer to use 0 based coordinates.

Here is how it would work:

int numCellsX = 8 + ((numX + 7) & ~7) + 8;
int numCellsY = 1 + numY + 1;
int numCellsXY = numCellsX * numCellsY;
// adjusted array_pointer
array_pointer = allocated_pointer + 8 + numCellsX + numCellsXY;
// assuming the isBadCell array is 0 based too.
for (int z = 0, indexZ = 0; z < numZ; z++, indexZ += numCellsXY) {
    for (int y = 0, indexY = indexZ; y < numY; y++, indexY += numCellsX) {
        for (int x = 0, index = indexY; x <= numX - 8; x += 8, index += 8) {
            int mask = isBadCell[index >> 3];
            if (mask == 0) {
                // let the compiler unroll computation for 8 pixels with
                for (int i = 0; i < 8; i++) {
                   // compute stencil value for x+i,y,z at index+i
                }
            } else {
                for (int i = 0; i < 8; i++, mask >>= 1) {
                    if (!(mask & 1)) {
                       // compute stencil value for x+i,y,z at index+i
                    }
                }
            }
        }
        int mask = isBadCell[index >> 3];
        for (; x < numX; x++, index++, mask >>= 1) {
            if (!(mask & 1)) {
                // compute stencil value for x,y,z at index
            }
        }
    }
}

EDIT:

The stencil function uses too many calls to getCellIndex. Here is how to optimize it using the index value computed in the above code:

// index is the offset of cell x,y,z
// numCellsX, numCellsY are the dimensions of the plane
// numCellsXY is the offset between planes: numCellsX * numCellsY

if (isBadCell[index]) {
    double temp = someOtherArray[index] +
                1.0 / CONSTANT / CONSTANT *
                ( - 1.0 * cells[index - 1]
                  - 1.0 * cells[index + 1]
                  - 1.0 * cells[index - numCellsX]
                  - 1.0 * cells[index + numCellsX]
                  - 1.0 * cells[index - numCellsXY]
                  - 1.0 * cells[index + numCellsXY]
                  + 6.0 * cells[index]
                );
    cells[index] += -omega * temp / 6.0 * CONSTANT * CONSTANT;
    globalTemp += temp * temp;
}

precomputing &cells[index] as a pointer might improve the code, but the compile should be able to detect this common subexpression and generate efficient code already.

EDIT2:

Here is a tiled approach: you can add the missing arguments, most sizes are assumed to be global but you should probably pass a pointer to a context structure with all these values. It uses isBadTile[] and isGoodTile[]: arrays of boolean telling if a given tile has all cells bad and all cells good respectively.

void handle_tile(int x, int y, int z, int nx, int ny, int nz) {
    int index0 = x + y * numCellsX + z * numCellsXY;
    // skipping a tile with all cells bad.
    if (isBadTile[index0] && nx == 4 && ny == 4 && nz == 4)
        return;
    // handling a 4x4x4 tile with all cells OK.
    if (isGoodTile[index0] && nx == 4 && ny == 4 && nz == 4) {
        for (int iz = 0; iz < 4; iz++) {
            for (int iy = 0; iy < 4; iy++) {
                for (int ix = 0; ix < 4; ix++) {
                    int index = index0 + ix + iy * numCellsX + iz + numCellsXY;
                    // Do stencil computation using `index`
                }
            }
        }
    } else {
        for (int iz = 0; iz < nz; iz++) {
            for (int iy = 0; iy < ny; iy++) {
                for (int ix = 0; ix < nx; ix++) {
                    int index = index0 + ix + iy * numCellsX + iz + numCellsXY;
                    if (!isBadCell[index] {
                        // Do stencil computation using `index`
                }
            }
        }
    }
}

void handle_cells() {
    int x, y, z;
    for (z = 1; z <= numZ; z += 4) {
        int nz = min(numZ + 1 - z, 4);
        for (y = 1; y <= numY; y += 4) {
            int ny = min(numY + 1 - y, 4);
            for (x = 1; x <= numX; x += 4) {
                int nx = min(numX + 1 - x, 4);
                handle_tile(x, y, z, nx, ny, nz);
            }
        }
    }
}

Here is a function to compute the isGoodTile[] array. The only offsets correctly computed correspond to values of x multiples of 4 + 1, y and z less than 3 from their maximum values.

This implementation is sub-optimal as fewer elements could be computed. Incomplete border tiles (less than 4 from the edge) could be flagged as not good to skip the good case with a single case. The test for bad tiles could work for these edge tiles if the isBadTile array was properly computed for the edge tiles, which is currently not the case.

void computeGoodTiles() {
    int start = 1 + numCellsX + numCellsXY;
    int stop = numCellsXY * numCellsZ - 1 - numCellsX - numCellsXY;

    memset(isGoodTile, 0, sizeof(*isGoodTile) * numCellsXY * numCellsZ);
    for (int i = start; i < stop; i += 4) {
        isGoodTile[i] = (isBadCell[i + 0] | isBadCell[i + 1] |
                         isBadCell[i + 2] | isBadCell[i + 3]) ^ 1;
    }
    for (int i = start; i < stop - 3 * numCellsX; i += 4) {
        isGoodTile[i] = isGoodTile[i + 0 * numCellsX] &
                        isGoodTile[i + 1 * numCellsX] &
                        isGoodTile[i + 2 * numCellsX] &
                        isGoodTile[i + 3 * numCellsX];
    }
    for (int i = start; i < stop - 3 * numCellsXY; i += 4) {
        isGoodTile[i] = isGoodTile[i + 0 * numCellsXY] &
                        isGoodTile[i + 1 * numCellsXY] &
                        isGoodTile[i + 2 * numCellsXY] &
                        isGoodTile[i + 3 * numCellsXY];
    }
}

void computeBadTiles() {
    int start = 1 + numCellsX + numCellsXY;
    int stop = numCellsXY * numCellsZ - 1 - numCellsX - numCellsXY;

    memset(isBadTile, 0, sizeof(*isBadTile) * numCellsXY * numCellsZ);
    for (int i = start; i < stop; i += 4) {
        isBadTile[i] = isBadCell[i + 0] & isBadCell[i + 1] &
                       isBadCell[i + 2] & isBadCell[i + 3];
    }
    for (int i = start; i < stop - 3 * numCellsX; i += 4) {
        isBadTile[i] = isBadTile[i + 0 * numCellsX] &
                       isBadTile[i + 1 * numCellsX] &
                       isBadTile[i + 2 * numCellsX] &
                       isBadTile[i + 3 * numCellsX];
    }
    for (int i = start; i < stop - 3 * numCellsXY; i += 4) {
        isBadTile[i] = isBadTile[i + 0 * numCellsXY] &
                       isBadTile[i + 1 * numCellsXY] &
                       isBadTile[i + 2 * numCellsXY] &
                       isBadTile[i + 3 * numCellsXY];
    }
}

Thanks for the answer, I'm going to go through it in a bit more detai, but I just wanted to ask how this would be extended so that we can skip whole blocks of cells? The main reason for the blocks optimisation is that we can calculate before the whole computation if a certain region/block contains badCells, and if not, we can use an optimised function to process it, and in other computations not shown here, we can actually skip these blocks altogether. Considering that the badCells take up a small fraction of the domain, this should make the code much faster. Does this make sense? — JC2188, Jan 29 '17 at 14:13
@JC2188: I updated the code for horizontal blocks of 8 pixels. You could easily extend this for 16, 32 or 64 pixels, but it is row oriented. If the bad blocks are clusters with smaller X dimensions but extend in the other dimensions, a more complicated approach could give good results, but I suspect you might need a different structure for the `isBadCell` array. — chqrlie, Jan 29 '17 at 14:21
@JC2188: Optimizing the inner loop should provide for substantial improvements. Try it first without the bit oriented stuff. — chqrlie, Jan 29 '17 at 14:32
Ok, I think I understand how this works. Would it be possible to do this so that rather than it being row-oriented, it is tiled? I think this would give better cache behaviour — JC2188, Jan 29 '17 at 21:11
@JC2188: it may give better cache behavior or it may not, only careful benchmarking will tell you which is more performant, and only for the architecture, sizes and compiler tested. A tiled approach can be attempted by adding more loops, but the incremental computation of the index computation would become more complicated and costly, probably not worthwhile. I shall post such an alternative. — chqrlie, Jan 29 '17 at 21:19
@JC2188: I updated the answer with a tile version. If you can compute an `isBadTile` array to use in `handle_tile`, you can remove the test at the pixel level and see if it further improves the performance. Computing this array is a bit tricky but you only need to compute values for `x`, `y`, `z` values equal to 1 modulo 4. — chqrlie, Jan 29 '17 at 22:07
Thanks for the update. I've tried to run the earlier version of EDIT2, however I think something is wrong with it. After each iteration, I'm expecting a value to be ~ 3x10^-6 and for it to stay roughly of that order throughout the computation, but for some reason, I'm a value of 12 on the first run, then this grows and grows throughout the computation. This might be something to do with how I've converted your example though. Shall I post the code here or can I message you with it? It's a lot of code to put on here, and I don't think it will be relevant once debugged. — JC2188, Jan 29 '17 at 23:30
I should also mention that the computation of good and bad cells/tiles only happens right at the beginning of the program, so there is no real need to optimise these. — JC2188, Jan 29 '17 at 23:36
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/134328/discussion-between-jc2188-and-chqrlie). — JC2188, Jan 29 '17 at 23:37

score 4 · Answer 2 · answered Jan 30 '17 at 14:18

Although OP requires an approach using blocking, I would suggest against it.

You see, each consecutive sequence of cells (the 1D cells along the X axis) is already such a block. Rather than make the problem simpler, blocking just replaces the original problem with smaller copies of fixed size, repeated over and over.

Simply put, blocking does not help at all with the real problem at hand. It should not be a requisite feature of a solution at all.

Instead, I would suggest avoiding the root problem altogether -- just in a different fashion.

You see, rather than have a "bad cell" flag for each cell that you need to test (once for each cell, no less), you can keep a (sorted) list of the bad cells' indexes. Then you can process the entire data set at once, followed by a fix-up loop over the cells listed in the bad cells' index list.

Also note that unless you work on a copy of the cell values, the order in which you compute the new cell values will affect the result. This is almost certainly not what you want.

So, here's my suggestion:

#include <stdlib.h>
#include <errno.h>

typedef struct {
    /* Core cells in the state, excludes border cells */
    size_t   xsize;
    size_t   ysize;
    size_t   zsize;

    /* Index calculation: x + y * ystride + z * zstride */
    /* x is always linear in memory; xstride = 1 */
    size_t   ystride; /* = xsize + 2 */
    size_t   zstride; /* = ystride * (ysize + 2) */

    /* Cell data, points to cell (0,0,0) */
    double  *current;
    double  *previous;

    /* Bad cells */
    size_t   fixup_cells;  /* Number of bad cells */
    size_t  *fixup_index;  /* Array of bad cells' indexes */

    /* Dynamically allocated memory */
    void    *mem[3];
} lattice;

void lattice_free(lattice *const ref)
{
    if (ref) {
        /* Free dynamically allocated memory, */
        free(ref->mem[0]);
        free(ref->mem[1]);
        free(ref->mem[2]);
        /* then initialize/poison the contents. */
        ref->xsize = 0;
        ref->ysize = 0;
        ref->zsize = 0;
        ref->ystride = 0;
        ref->zstride = 0;
        ref->previous = NULL;
        ref->current = NULL;
        ref->fixup_cells = 0;
        ref->fixup_index = NULL;
        ref->mem[0] = NULL;
        ref->mem[1] = NULL;
        ref->mem[2] = NULL;
    }
}


int lattice_init(lattice *const ref, const size_t xsize, const size_t ysize, const size_t zsize)
{
    const size_t  xtotal = xsize + 2;
    const size_t  ytotal = ysize + 2;
    const size_t  ztotal = zsize + 2;
    const size_t  ntotal = xtotal * ytotal * ztotal;
    const size_t  double_bytes = ntotal * sizeof (double);
    const size_t  size_bytes = xsize * ysize * zsize * sizeof (size_t);

    /* NULL reference to the variable to initialize? */
    if (!ref)
        return EINVAL;

    /* Initialize/poison the lattice variable. */
    ref->xsize = 0;
    ref->ysize = 0;
    ref->zsize = 0;
    ref->ystride = 0;
    ref->zstride = 0;
    ref->previous = NULL;
    ref->current = NULL;
    ref->fixup_cells = 0;
    ref->fixup_index = NULL;
    ref->mem[0] = NULL;
    ref->mem[1] = NULL;
    ref->mem[2] = NULL;

    /* Verify size is nonzero */
    if (xsize < 1 || ysize < 1 || zsize < 1)
        return EINVAL;        

    /* Verify size is not too large */
    if (xtotal <= xsize || ytotal <= ysize || ztotal <= zsize ||
        ntotal / xtotal / ytotal != ztotal ||
        ntotal / xtotal / ztotal != ytotal ||
        ntotal / ytotal / ztotal != xtotal ||
        double_bytes / ntotal != sizeof (double) ||
        size_bytes / ntotal != sizeof (size_t))
        return ENOMEM;

    /* Allocate the dynamic memory needed. */
    ref->mem[0] = malloc(double_bytes);
    ref->mem[1] = malloc(double_bytes);
    ref->mem[2] = malloc(size_bytes);
    if (!ref->mem[0] || !ref->mem[1] || !ref->mem[2]) {
        free(ref->mem[2]);
        ref->mem[2] = NULL;
        free(ref->mem[1]);
        ref->mem[1] = NULL;
        free(ref->mem[0]);
        ref->mem[0] = NULL;
        return ENOMEM;
    }

    ref->xsize = xsize;
    ref->ysize = ysize;
    ref->zsize = zsize;

    ref->ystride = xtotal;
    ref->zstride = xtotal * ytotal;

    ref->current = (double *)ref->mem[0] + 1 + xtotal;
    ref->previous = (double *)ref->mem[1] + 1 + xtotal;

    ref->fixup_cells = 0;
    ref->fixup_index = (size_t *)ref->mem[2];

    return 0;
}

Note that I prefer the x + ystride * y + zstride * z index calculation form over x + xtotal * (y + ytotal * z), because the two multiplications in the former can be done in parallel (in a superscalar pipeline, on architectures which can do two unrelated integer multiplications at the same time on a single CPU core), whereas in the latter, the multiplications have to be sequential.

Note that ref->current[-1 - ystride - zstride] refers to the current cell value at cell (-1, -1, -1), i.e. the border cell diagonal from the origin cell (0, 0, 0). In other words, if you have a cell (x, y, z) at index i, then
i-1 is the cell at (x-1, y, z)
i+1 is the cell at (x+1, y, z)
i-ystride is the cell at (x, y-1, z)
i+ystride is the cell at (x, y+1, z)
i-zstride is the cell at (x, y, z-1)
i+zstride is the cell at (x, y, z-1)
i-ystride is the cell at (x, y-1, z)
i-1-ystride-zstride is the cell at (x-1, y-1, z-1)
i+1+ystride+zstride is the cell at (x+1, y+1, z+1)
and so on.

The ref->fixup_index array is large enough to list all cells, except for border cells. It is a good idea to keep it sorted (or sort it after building it), because that helps with cache locality.

If your lattice has periodic boundary conditions, you can use six 2D loops, twelve 1D loops, and eight copies to copy the first and last valid cells to the border, before starting a new update.

Your update cycle is therefore essentially:

Compute or fill borders in ->current.
Swap ->current and ->previous.
Compute all cells for ->current using data from ->previous.
Loop over the ->fixup_cells indexes in ->fixup_index, and recalculate corresponding ->current cells.

Note that in step 3, you can do it linearly for all indexes between 0 and xsize-1 + (ysize-1)*ystride + (zsize-1)*zstride, inclusive; that is, including about 67% of the border cells. They are relatively few compared to the entire volume, and having a single linear loop is likely faster than skipping over the border cells -- especially if you can vectorize the computation. (Which, in this case, is nontrivial.)

You can even split the work for multiple threads, by giving each thread a continous set of indexes to work on. Because you read from ->previous and write to ->current, the threads won't trample over each other, although there may be some cacheline ping-pong if a thread reaches the end of its region while another is at the start of its region; because of how the data is oriented (and cache lines are just a few -- typically 2, 4, or 8 -- cells in size), that ping-pong should not be an issue in practice. (Obviously, no locks are needed.)

This particular problem is not really new in any way. Modeling Conway's Game of Life or square- or cubic-lattice Ising model, as well as implementing many other lattice models, involve the same problem (but often with Boolean data rather than doubles, and without "bad cells").

score 2 · Answer 3 · answered Jan 29 '17 at 14:50

I think you may nest a couple of similar set of loops. Something like this:

for(int z = 1; z < numZ+1; z+=4) {
    for(int y = 1; y < numY+1; y+=4) {
        for(int x = 1; x < numX+1; x+=4) {
            if(!isBadBlock[ getBlockIndex(x>>2,y>>2,z>>2) ]) {
                for(int zz = z; zz < z + 4 && zz < numZ+1; zz++) {
                   for(int yy = y; yy < y + 4 && yy < numY+1; yy++) {
                      for(int xx = z; xx < x + 4 && xx < numX+1; xx++) {
                         if(!isBadCell[ getCellIndex(xx,yy,zz) ]) {
                             // Do stencil Computation
                            }
                        }
                    }
                }
            }
        }
    }
}

Inn the above code, `isBadBlock` is an array of booleans indicating the blocks will all cells bad. The OP seems to assume there are more blocks with all cells good and wants to optimize these first. I guess both approaches could be combined depending on the distribution of bad cells. — chqrlie, Jan 29 '17 at 22:39

technosaurus · Answer 4 · 2017-01-30T01:04:29.303

The way you currently have it set up, you can simply get the index by using a 3d array as follows:

#include <sys/types.h>
#define numX 256
#define numY 128
#define numZ 64
//Note the use of powers of 2 - it will simplify things a lot

int cells[numX][numY][numZ];

size_t getindex(size_t x, size_t y,size_t z){
  return (int*)&cells[x][y][z]-(int*)&cells[0][0][0];
}

This will lay out the cells like:

[0,0,0][0,0,1][0,0,2]...[0,0,numZ-1]
[0,1,0][0,1,1][0,1,2]...[0,1,numZ-1]
...
[0,numY-1,0][0,numY-1,1]...[0,1,numZ-1]
...
[1,0,0][1,0,1][0,0,2]...[1,0,numZ-1]
[1,1,0][1,1,1][1,1,2]...[1,1,numZ-1]
...
[numX-1,numY-1,0][numX-1,numY-1,1]...[numX-1,numY-1,numZ-1]

So efficient loops would look like:

for(size_t x=0;x<numX;x++)
  for(size_t y=0;y<numY;y++)
    for(size_t z=0;z<numZ;z++)
      //vector operations on z values

But, if you want to split it into 4x4x4 blocks, you could just use a 3d array of 4x4x4 blocks something like:

#include <sys/types.h>
#define numX 256 
#define numY 128
#define numZ 64

typedef int block[4][4][4];
block blocks[numX][numY][numZ];
//add a compiler specific 64 byte alignment to  help with cache misses?

size_t getblockindex(size_t x, size_t y,size_t z){
  return (block *)&blocks[x][y][z]-(block *)&blocks[0][0][0];
}

I reordered the indices to x,y,z just so I could keep them straight in my head, but make sure that you order them so that the last one is the one you operate on a series of in your innermost for loops.

How to break an array into blocks

4 Answers4