Tiled softmax operation in CUDA

Question

I am teaching myself some CUDA and I am trying to implement a simple softmax operation on a 2D tensor of size N x M along the second dimension. I have managed to write a naive solution that works when the column tile size is equal to my column dimension (i.e., when TILE_DIM_X = M). However, my function breaks when TILE_DIM_X < M. This makes sense since the addition is only carried on the tile and not on the full row.

Now, how can I address this issue to account for higher values of M? Furthermore, how can I carry this reduction more efficiently? I've been reading on parallel reductions and CUB, and I've been trying to implement a solution that makes use of cub::BlockReduce() but no luck so far.

This is just a tiny part of my kernel, so I am trying to "modularize" operations as much as possible to make this somewhat readable. As such, I am not interested in already-exisiting fused kernels for softmax.

#include <cuda_runtime.h>
#include <iostream>
#include <random>

const int TILE_DIM_Y = 32;  // Tile dimension for rows
const int TILE_DIM_X = 32;  // Tile dimension for columns

__global__ void softmaxKernel2D(const float* input, float* output, int N, int M) {
    int row = blockIdx.y * TILE_DIM_Y + threadIdx.y;
    int col = blockIdx.x * TILE_DIM_X + threadIdx.x;

    __shared__ float tile[TILE_DIM_Y][TILE_DIM_X];
    __shared__ float exp_sums[TILE_DIM_Y];

    // Copy data from global memory to shared memory
    if (row < N && col < M) {
        tile[threadIdx.y][threadIdx.x] = input[row * M + col];
    }

    // Synchronize
    __syncthreads();

    // Calculate the sum of exponentials for each row
    float exp_val = expf(tile[threadIdx.y][threadIdx.x]);
    atomicAdd(&exp_sums[threadIdx.y], exp_val);

    // Synchronize
    __syncthreads();

    // Compute the softmax values
    if (row < N && col < M) {
        output[row * M + col] = exp_val / exp_sums[threadIdx.y];
    }
}

void randomInit(std::vector<float> &x) {
    // Pseudo-random float vector
    std::random_device rd;
    std::mt19937 gen(rd());
    std::uniform_real_distribution<> unif(-1, 1);
    for (int i = 0; i < x.size(); i++) {
        x[i] = unif(gen);
    }
}

bool verifyOutput(const float* output, int N, int M) {
    for (int i = 0; i < N; i++) {
        float softmax_val = 0;
        for (int j = 0; j < M; j++) {
            softmax_val += output[i * M + j];
        }
        // Check if row sums to one
        if (fabs(softmax_val - 1) > 1e-6) {
            std::cout << softmax_val << std::endl;
            return false;
        }
    }
    return true;
}

int main() {
    int N = 2048;
    int M = 32;
    std::vector<float> h_input(N * M);
    std::vector<float> h_output(N * M);
    float *d_input, *d_output;

    // Randomly initialize input
    randomInit(h_input);

    // Allocate memory on the device
    cudaMalloc(&d_input, N * M * sizeof(float));
    cudaMalloc(&d_output, N * M * sizeof(float));

    // Copy to device
    cudaMemcpy(d_input,  h_input.data(),  N * M * sizeof(float), cudaMemcpyHostToDevice);
    cudaMemcpy(d_output, h_output.data(), N * M * sizeof(float), cudaMemcpyHostToDevice);

    // Define block and grid dimensions
    dim3 threads(TILE_DIM_X, TILE_DIM_Y);
    dim3 blocks((M + TILE_DIM_X - 1) / TILE_DIM_X, (N + TILE_DIM_Y - 1) / TILE_DIM_Y);
    int shmem_size = TILE_DIM_X * TILE_DIM_Y * sizeof(float);

    // Launch the kernel
    softmaxKernel2D<<<blocks, threads, shmem_size>>>(d_input, d_output, N, M);

    // Copy to host
    cudaMemcpy(h_output.data(), d_output, N * M * sizeof(float), cudaMemcpyDeviceToHost);

    // Check
    if (verifyOutput(h_output.data(), N, M)) {
        std::cout << "Success!" << std::endl;
    } else {
        std::cout << "Failure!" << std::endl;
    }

    // Clean up
    cudaFree(d_input);
    cudaFree(d_output);

    return 0;
}

Robert Crovella · Accepted Answer · 2023-08-25T15:24:45.907

One possible approach to handle the larger arrays is to split your work into 2 kernels. The first will handle the creation of the global sums for each row. The second will perform the elementwise division. The breakpoint between the two kernels gives us the needed global synchronization between these two steps.

If we focus on the case where we are using 32x32 threadblocks, we can actually dispense with the shared memory usage altogether. We will use a warp-shuffle reduction for the row-wise accumulation of partial sums at the threadblock level. We will then atomically add the row-wise partial sums to global sum accumulators per row, to complete the first kernel processing.

The second kernel does an elementwise division to complete the processing.

Here is an example:

# cat t27.cu
#include <cuda_runtime.h>
#include <iostream>
#include <random>

const int TILE_DIM_Y = 32;  // Tile dimension for rows
const int TILE_DIM_X = 32;  // Tile dimension for columns// must be 32 for this method

template <typename T>
__global__ void softmaxKernel2D_rows(const T* input, T* exp_sums, int N, int M) {
    int row = blockIdx.y * TILE_DIM_Y + threadIdx.y;
    int col = blockIdx.x * TILE_DIM_X + threadIdx.x;
    T val = 0;
    // Copy data from global memory to shared memory
    if (row < N && col < M) {
        if (sizeof(T) == 8)
          val = exp(input[row * M + col]);
        else
          val = expf(input[row * M + col]);
    }
    // warp shuffle reduction
    // Use XOR mode to perform butterfly reduction
    for (int i=16; i>=1; i>>=1)
        val += __shfl_xor_sync(0xffffffff, val, i, 32);
    // update global value for row
    if ((threadIdx.x == 0) && (row < N)) atomicAdd(exp_sums+row, val);
}

template <typename T>
__global__ void softmaxKernel2D_elementwise(const T* input, const T* exp_sums, T *output, int N, int M) {
    int row = blockIdx.y * TILE_DIM_Y + threadIdx.y;
    int col = blockIdx.x * TILE_DIM_X + threadIdx.x;
    // Compute the softmax values
    if (row < N && col < M) {
      if (sizeof(T) == 8)
        output[row * M + col] = exp(input[row * M + col])/ exp_sums[row];
      else
        output[row * M + col] = expf(input[row * M + col])/ exp_sums[row];
    }
}

template <typename T>
void randomInit(std::vector<T> &x) {
    // Pseudo-random float vector
    std::random_device rd;
    std::mt19937 gen(rd());
    std::uniform_real_distribution<> unif(-1, 1);
    for (int i = 0; i < x.size(); i++) {
        x[i] = unif(gen);
    }
}
template <typename T>
bool verifyOutput(const T* output, int N, int M) {
    for (int i = 0; i < N; i++) {
        T softmax_val = 0;
        for (int j = 0; j < M; j++) {
            softmax_val += output[i * M + j];
        }
        // Check if row sums to one
        if (fabs(softmax_val - 1) > 1e-6) {
            std::cout << softmax_val << std::endl;
            return false;
        }
    }
    return true;
}

using mt = float;

int main() {
    int N = 2048; // number of rows in dataset
    int M = 512; // number of columns in dataset
    std::vector<mt> h_input(N * M);
    std::vector<mt> h_output(N * M);
    mt *d_input, *d_output, *d_sums;

    // Randomly initialize input
    randomInit(h_input);

    // Allocate memory on the device
    cudaMalloc(&d_input, N * M * sizeof(mt));
    cudaMalloc(&d_output, N * M * sizeof(mt));
    cudaMalloc(&d_sums, N * sizeof(mt));
    cudaMemset(d_sums, 0, N*sizeof(mt));
    // Copy to device
    cudaMemcpy(d_input,  h_input.data(),  N * M * sizeof(mt), cudaMemcpyHostToDevice);

    // Define block and grid dimensions
    dim3 threads(TILE_DIM_X, TILE_DIM_Y);
    dim3 blocks((M + TILE_DIM_X - 1) / TILE_DIM_X, (N + TILE_DIM_Y - 1) / TILE_DIM_Y);

    // Launch the kernel
    softmaxKernel2D_rows<<<blocks, threads>>>(d_input, d_sums, N, M);
    softmaxKernel2D_elementwise<<<blocks, threads>>>(d_input, d_sums, d_output,  N, M);

    // Copy to host
    cudaMemcpy(h_output.data(), d_output, N * M * sizeof(mt), cudaMemcpyDeviceToHost);

    // Check
    if (verifyOutput(h_output.data(), N, M)) {
        std::cout << "Success!" << std::endl;
    } else {
        std::cout << "Failure!" << std::endl;
    }

    // Clean up
    cudaFree(d_input);
    cudaFree(d_output);

    return 0;
}
# nvcc -o t27 t27.cu -arch=sm_89
# ./t27
Success!
#

Notes:

you are mixing methodology for shared memory usage. With the kind of shared memory declarations you have:

__shared__ float tile[TILE_DIM_Y][TILE_DIM_X];
__shared__ float exp_sums[TILE_DIM_Y];

we refer to those as static allocations. They don't require dynamically allocated space at the point of kernel launch:

int shmem_size = TILE_DIM_X * TILE_DIM_Y * sizeof(float);

// Launch the kernel
softmaxKernel2D<<<blocks, threads, shmem_size>>>(d_input, d_output, N, M);
                                   ^^^^^^^^^^

the shared memory declaration:
```
__shared__ float exp_sums[TILE_DIM_Y];
```
does not initialize any values. So you are adding to an uninitialized value here:
```
atomicAdd(&exp_sums[threadIdx.y], exp_val);
```
for large enough datasets, your usage of float will eventually run out of resolution for your math operations to satisfy the check method you are using. The dataset size I have chosen (2048 elements per row) is on the edge of what may work. Failures reported by your check algorithm don't necessarily indicate a problem with the algorithm. Stated another way, a similar CPU method could also run into resolution limits using float. I have written my code in such a way that you can test this. You will be able to handle much larger data sets by switching using mt = float; to using mt = double;. Also, in all cases, be sure to compile for the architecture to match your GPU (e.g. -arch=sm_XY). If you are using a GPU that does not support double-precision atomicAdd, you will get an error when trying to compile this code for double.
why don't I store the computed exponential of the input in the first kernel (for example back into the input data array) for use later by the second kernel? Why recompute the exponential in the second kernel? In my view, both of these kernels are probably memory-bound. In a memory bound setting, the performance of the kernel is strongly correlated to the number of load and store operations used. If we want to save the exponential computed elementwise in the first kernel, it will add an additional store operation. In a memory-bound kernel, the cost of the exp/expf operation should be approximately "free". So we prefer to recompute the value in the 2nd kernel, rather than burdening the first kernel with an additional store operation.

Thanks Robert for the great answer. For context, I'm trying to fuse many operations into one kernel to reduce the number of kernel launches for performance reasons. The softmax is just one of many operations I would like to perform within the kernel. Do you think that would be feasible? Operations will always at maximum row-wise (i.e. along the x-axis). Maybe not worth the trouble? — iHubble, Aug 23 '23 at 15:35
kernel fusion has [potential value](https://stackoverflow.com/questions/53305830/cuda-how-does-kernel-fusion-improve-performance-on-memory-bound-applications-on/53311373#53311373) especially for "simple" operations like this with load and store steps. It's difficult to estimate the value without looking at specific examples. Breaking this kernel into two doesn't help with kernel fusion efforts, but you could refactor this to use [cooperative groups grid sync](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#grid-synchronization) to recombine these two kernels I have here. — Robert Crovella, Aug 23 '23 at 16:00

Tiled softmax operation in CUDA

1 Answers1