I have an array that represents points in a cuboid. It is a one dimensional array, which uses the following indexing function to realise the 3 dimensions:
int getCellIndex(int ix, int iy, int iz) {
return ix + (iy * numCellsX) + (iz * numCellsX * numCellsY);
}
The number of cells in the domain is:
numCells = (numX + 2) * (numY + 2) * (numZ + 2)
Where numX/numY/numZ are the number of cells in the X/Y/Z direction. The +2 in each direction is to create padding cells around the outside of the domain. The number of cells in each direction is give by:
numX = 5 * numY
numZ = numY/2
numY = userInput
For each cell, I want to calculate a new value for that cell based upon it's neighbours value (i.e. a stencil), where it's neighbours are above, below, left, right, front and back. However, I only want to do this calculation for cells that aren't bad. I have a boolean array that tracks if a cell is bad. This is what the computation currently looks like:
for(int z = 1; z < numZ+1; z++) {
for(int y = 1; y < numY+1; y++) {
for(int x = 1; x < numX+1; x++) {
if(!isBadCell[ getCellIndex(x,y,z) ] {
// Do stencil Computation
}
}
}
}
This is not great performance wise. I want to be able to vectorize the loop to improve performance, however I can't because of the if statement. I know if cells are bad in advance and this does not change throughout the computation. I'd like to split the domain up into blocks, preferably 4x4x4 blocks, so that I can calculate a priori per block if it contains bad cells, and if so process it as usual, or if not, use an optimized function that can take advantage of vectorization e.g.
for(block : blocks) {
if(isBadBlock[block]) {
slowProcessBlock(block) // As above
} else {
fastVectorizedProcessBlock(block)
}
}
NOTE: It is not required for the blocks to physically exist i.e. this could be achieved by changing the indexing function, and using different indexes to loop over the array. I'm open to whatever works best.
The fastVectorizedProcessBlock() function would look similar to the slowProcessBlock() function, but with the if statement remove (since we know it doesn't contain bad cells), and a vectorization pragma.
How can I split up my domain into blocks so that I can accomplish this? It seems tricky because a) the number of cells in each direction are not equal, b) we need to take into account the padding cells, as we must never attempt to calculate their value, as this would lead to a memory access that is out of bounds.
How can I then process the blocks that don't contain bad cells without using an if statement?
EDIT:
This is the idea I originally had:
for(int i = 0; i < numBlocks; i++) { // use blocks of 4x4x4 = 64
if(!isBadBlock[i]) {
// vectorization pragma here
for(int z = 0; z < 4; z++) {
for(int y = 0; y < 4; y++) {
for(int x = 0; x < 4; x++) {
// calculate stencil using getCellIndex(x,y,z)*i
}
}
}
} else {
for(int z = 0; z < 4; z++) {
for(int y = 0; y < 4; y++) {
for(int x = 0; x < 4; x++) {
if(!isBadCell[i*getCellIndex(x,y,z)]) {
// calculate stencil using getCellIndex(x,y,z)*i
}
}
}
}
}
Cells would now be stored in blocks, i.e. all the cells in the first 4x4x4 block would be stored in pos 0-63, then all cells in the second block would be stored in pos 64-127 etc.
However, I don't think will work if the numX/numY/numZ values are not kind. For example, what if numY = 2, numZ = 1 and numX = 10? The for loops would expect the z direction to be at least 4 cells deep. Is there a good way to get past this?
UPDATE 2 - Here is what the stencil computation looks like:
if ( isBadCell[ getCellIndex(x,y,z) ] ) {
double temp = someOtherArray[ getCellIndex(x,y,z) ] +
1.0/CONSTANT/CONSTANT*
(
- 1.0 * cells[ getCellIndex(x-1,y,z) ]
- 1.0 * cells[ getCellIndex(x+1,y,z) ]
- 1.0 * cells[ getCellIndex(x,y-1,z) ]
- 1.0 * cells[ getCellIndex(x,y+1,z) ]
- 1.0 * cells[ getCellIndex(x,y,z-1) ]
- 1.0 * cells[ getCellIndex(x,y,z+1) ]
+ 6.0 * cells[ getCellIndex(x,y,z) ]
);
globalTemp += temp * temp;
cells[ getCellIndex(x,y,z) ] += -omega * temp / 6.0 * CONSTANT * CONSTANT;
}