How can I use shared memory here in my CUDA kernel?

Question

I have the following CUDA kernel:

__global__ void optimizer_backtest(double *data, Strategy *strategies, int strategyCount, double investment, double profitability) {
    // Use a grid-stride loop.
    // Reference: https://devblogs.nvidia.com/parallelforall/cuda-pro-tip-write-flexible-kernels-grid-stride-loops/
    for (int i = blockIdx.x * blockDim.x + threadIdx.x;
         i < strategyCount;
         i += blockDim.x * gridDim.x)
    {
        strategies[i].backtest(data, investment, profitability);
    }
}

TL;DR I would like to find a way to store data in shared (__shared__) memory. What I don't understand is how to fill the shared variable using multiple threads.

I have seen examples like this one where data is copied to shared memory thread by thread (e.g. myblock[tid] = data[tid]), but I'm not sure how to do this in my situation. The issue is that each thread needs access to an entire "row" (flattened) of data with each iteration through the data set (see further below where the kernel is called).

I'm hoping for something like this:

__global__ void optimizer_backtest(double *data, Strategy *strategies, int strategyCount, int propertyCount, double investment, double profitability) {
    __shared__ double sharedData[propertyCount];

    // Use a grid-stride loop.
    // Reference: https://devblogs.nvidia.com/parallelforall/cuda-pro-tip-write-flexible-kernels-grid-stride-loops/
    for (int i = blockIdx.x * blockDim.x + threadIdx.x;
         i < strategyCount;
         i += blockDim.x * gridDim.x)
    {
        strategies[i].backtest(sharedData, investment, profitability);
    }
}

Here are more details (if more information is needed, please ask!):

strategies is a pointer to a list of Strategy objects, and data is a pointer to an allocated flattened data array.

In backtest() I access data like so:

data[0]
data[1]
data[2]
...

Unflattened, data is a fixed size 2D array similar to this:

[87.6, 85.4, 88.2, 86.1]
 84.1, 86.5, 86.7, 85.9
 86.7, 86.5, 86.2, 86.1
 ...]

As for the kernel call, I iterate over the data items and call it n times for n data rows (about 3.5 million):

int dataCount = 3500000;
int propertyCount = 4;

for (i=0; i<dataCount; i++) {
    unsigned int dataPointerOffset = i * propertyCount;

    // Notice pointer arithmetic.
    optimizer_backtest<<<32, 1024>>>(devData + dataPointerOffset, devStrategies, strategyCount, investment, profitability);
}

Why is `sharedData` an array of pointers? It isn't at all clear what you are trying to do here — talonmies, Jun 07 '16 at 20:11
It's not an array of pointers. It's what amounts to a pointer to an array in memory. As I have already explained, it is a flat structure like this: `[87.6, 85.4, 88.2, 86.1, 84.1, 86.5, 86.7, 85.9, 86.7, 86.5, 86.2, 86.1, ...]`. `data[1]` is `85.4`. And that's exactly how `Strategy::backtest()` uses `data`: it accesses items by index. `data` is initialized as follows: `double *data = (double*)malloc(dataPointCount * dataPropertyCount * sizeof(double));` I'm not sure what else there is to explain. — Chad Johnson, Jun 07 '16 at 20:14
Oh, I see what you are saying. `sharedData` should be `__shared__ double sharedData[propertyCount]`. I will update this. Thank you. Please don't vote to close my question just because of one typo. — Chad Johnson, Jun 07 '16 at 20:15
Are you really saying you can't write code to load 4 double values into a shared memory array? What *exactly* is your problem in doing that? I'm not voting to close this because of typos, I am voting to close it because it is a vague, broad post with lots of words and no real concrete question. — talonmies, Jun 07 '16 at 20:35
No. What I don't (didn't, now) understand is how to fill the shared variable using multiple threads (I've updated the question to include this). But now I understand, thanks to someone on Reddit, that "`__syncthreads()` is a block-wide barrier, which means that it makes sure that all threads in a given block reached this point before it continues." I will post working code here as an answer for others' reference once I have it working. Here is [the discussion thread](https://www.reddit.com/r/CUDA/comments/4n04cz/how_can_i_use_shared_memory_here_in_my_cuda_kernel/d3zpv9c). — Chad Johnson, Jun 07 '16 at 20:37
So you actual question is about how to synchronize threads within a block after a subset of those threads have performed a load to shared memory? That's a great, concise, easy to answer question, but for the asking..... — talonmies, Jun 07 '16 at 20:46
It's kind of unclear what you want to do with your code, and why you need the shared memory. It helps if you can explain these a little bit. Shared memory is usually used to exchange data across threads. So I guess you want to apply 20k strategies on every one of the 3.5m data and exam the 20k x 3.5m results? If yes, it sounds like a good situation to use shared memory. But you need much larger shared mem than 4 elements, and you probably want to cache the strategies with shared memory too. — kangshiyin, Jun 08 '16 at 02:17
@Eric Exactly. The goal is to pass every data "row" (flattened) to backtest() for every strategy as you described. There are actually 838 elements, and the example was just simplified to 4. I may try caching the strategies too :) — Chad Johnson, Jun 08 '16 at 02:22

Chad Johnson · Answer 1 · 2016-09-21T16:38:20.093

For people in the future in search of a similar answer, here is what I ended up with for my kernel function:

__global__ void optimizer_backtest(double *data, Strategy *strategies, int strategyCount, double investment, double profitability) {
    __shared__ double sharedData[838];

    if (threadIdx.x < 838) {
        sharedData[threadIdx.x] = data[threadIdx.x];
    }

    __syncthreads();

    // Use a grid-stride loop.
    // Reference: https://devblogs.nvidia.com/parallelforall/cuda-pro-tip-write-flexible-kernels-grid-stride-loops/
    for (int i = blockIdx.x * blockDim.x + threadIdx.x;
         i < strategyCount;
         i += blockDim.x * gridDim.x)
    {
        strategies[i].backtest(sharedData, investment, profitability);
    }
}

Note that I use both .cuh and .cu files in my application, and I put this in the .cu file. Also note that I use --device-c in my Makefile when compiling object files. I don't know if that's how things should be done, but that's what worked for me.

score 1 · Answer 2 · answered Jun 08 '16 at 03:17

As confirmed in your comment, you want to apply 20k (this number is from your previous question) strategies on every one of the 3.5m data and exam the 20k x 3.5m results.

Without shared memory you have to read all data 20k times or all strategies 3.5m times, from the global memory.

Shared memory can speed up your program by reducing global memory access. Say you can read 1k strategies and 1k data to shared mem each time, exam the 1k x 1k results, and then repeat this until all are examed. By this way you can reduce the global mem access to 20 times of all data and 3.5k times of all strategies. This situation is similar to vector-vectoer cross product. You could find some reference code for more detail.

However each one of your data is large (838-D vector), maybe strategies are large too. You may not be able to cache a lot of them in the shared mem (only ~48k per block depending on the device type ). So the situation changes to something like matrix-matrix multiplication. For this, you may get some hints from the matrix multiplication code as in the following link.

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#shared-memory

How can I use shared memory here in my CUDA kernel?

2 Answers2