I have the following CUDA kernel:
__global__ void optimizer_backtest(double *data, Strategy *strategies, int strategyCount, double investment, double profitability) {
// Use a grid-stride loop.
// Reference: https://devblogs.nvidia.com/parallelforall/cuda-pro-tip-write-flexible-kernels-grid-stride-loops/
for (int i = blockIdx.x * blockDim.x + threadIdx.x;
i < strategyCount;
i += blockDim.x * gridDim.x)
{
strategies[i].backtest(data, investment, profitability);
}
}
TL;DR I would like to find a way to store data
in shared (__shared__
) memory. What I don't understand is how to fill the shared variable using multiple threads.
I have seen examples like this one where data
is copied to shared memory thread by thread (e.g. myblock[tid] = data[tid]
), but I'm not sure how to do this in my situation. The issue is that each thread needs access to an entire "row" (flattened) of data with each iteration through the data set (see further below where the kernel is called).
I'm hoping for something like this:
__global__ void optimizer_backtest(double *data, Strategy *strategies, int strategyCount, int propertyCount, double investment, double profitability) {
__shared__ double sharedData[propertyCount];
// Use a grid-stride loop.
// Reference: https://devblogs.nvidia.com/parallelforall/cuda-pro-tip-write-flexible-kernels-grid-stride-loops/
for (int i = blockIdx.x * blockDim.x + threadIdx.x;
i < strategyCount;
i += blockDim.x * gridDim.x)
{
strategies[i].backtest(sharedData, investment, profitability);
}
}
Here are more details (if more information is needed, please ask!):
strategies
is a pointer to a list of Strategy
objects, and data
is a pointer to an allocated flattened data array.
In backtest()
I access data like so:
data[0]
data[1]
data[2]
...
Unflattened, data is a fixed size 2D array similar to this:
[87.6, 85.4, 88.2, 86.1]
84.1, 86.5, 86.7, 85.9
86.7, 86.5, 86.2, 86.1
...]
As for the kernel call, I iterate over the data items and call it n times for n data rows (about 3.5 million):
int dataCount = 3500000;
int propertyCount = 4;
for (i=0; i<dataCount; i++) {
unsigned int dataPointerOffset = i * propertyCount;
// Notice pointer arithmetic.
optimizer_backtest<<<32, 1024>>>(devData + dataPointerOffset, devStrategies, strategyCount, investment, profitability);
}