The kernel shown in the question could be used without modification to sum a vector to each of the rows of a matrix (assuming c-style row-major storage), subject to certain limitations. A demonstration is here.
The main limitation of that approach is that the maximum vector length and therefore matrix width that can be handled is equal to the maximum number of threads per block, which on current CUDA 7-supported GPUs is 1024.
We can eliminate that limitation with a slight modification to the vector indexing, and passing the row width (number of columns) as a parameter to the matrix. With this modification, we should be able to handle arbitrary matrix (and vector) sizes.
EDIT: based on discussion/comments, OP wants to know how to handle row-major or column major underlying storage. The following example uses a templated kernel to select either row-major or column major underlying storage, and also shows one possible CUBLAS method for doing a add-vector-to-each-matrix-row operation using rank-1 update function:
$ cat t712.cu
#include <iostream>
#include <cublas_v2.h>
#define ROWS 20
#define COLS 10
#define nTPB 64
#define ROW_MAJOR 0
#define COL_MAJOR 1
template <int select, typename T>
__global__ void vec_mat_row_add(const unsigned int height, const unsigned int width, T* matrix, const T* vector)
{
// get the current element index for the thread
unsigned int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < height*width)
{
// sum the current element with the
if (select == ROW_MAJOR)
matrix[idx] += vector[idx%width];
else // COL_MAJOR
matrix[idx] += vector[idx/height];
}
}
int main(){
float *h_mat, *d_mat, *h_vec, *d_vec;
const unsigned int msz = ROWS*COLS*sizeof(float);
const unsigned int vsz = COLS*sizeof(float);
h_mat = (float *)malloc(msz);
h_vec = (float *)malloc(vsz);
cudaMalloc(&d_mat, msz);
cudaMalloc(&d_vec, vsz);
for (int i=0; i<COLS; i++) h_vec[i] = i; // set vector to 0,1,2, ...
cudaMemcpy(d_vec, h_vec, vsz, cudaMemcpyHostToDevice);
// test row-major case
cudaMemset(d_mat, 0, msz); // set matrix to zero
vec_mat_row_add<ROW_MAJOR><<<(ROWS*COLS + nTPB -1)/nTPB, nTPB>>>(ROWS, COLS, d_mat, d_vec);
cudaMemcpy(h_mat, d_mat, msz, cudaMemcpyDeviceToHost);
std::cout << "Row-major result: " << std::endl;
for (int i = 0; i < ROWS; i++){
for (int j = 0; j < COLS; j++) std::cout << h_mat[i*COLS+j] << " ";
std::cout << std::endl;}
// test column-major case
cudaMemset(d_mat, 0, msz); // set matrix to zero
vec_mat_row_add<COL_MAJOR><<<(ROWS*COLS + nTPB -1)/nTPB, nTPB>>>(ROWS, COLS, d_mat, d_vec);
cudaMemcpy(h_mat, d_mat, msz, cudaMemcpyDeviceToHost);
std::cout << "Column-major result: " << std::endl;
for (int i = 0; i < ROWS; i++){
for (int j = 0; j < COLS; j++) std::cout << h_mat[j*ROWS+i] << " ";
std::cout << std::endl;}
// test CUBLAS, doing matrix-vector add using <T>ger
cudaMemset(d_mat, 0, msz); // set matrix to zero
float *d_ones, *h_ones;
h_ones = (float *)malloc(ROWS*sizeof(float));
for (int i =0; i<ROWS; i++) h_ones[i] = 1.0f;
cudaMalloc(&d_ones, ROWS*sizeof(float));
cudaMemcpy(d_ones, h_ones, ROWS*sizeof(float), cudaMemcpyHostToDevice);
cublasHandle_t ch;
cublasCreate(&ch);
float alpha = 1.0f;
cublasStatus_t stat = cublasSger(ch, ROWS, COLS, &alpha, d_ones, 1, d_vec, 1, d_mat, ROWS);
if (stat != CUBLAS_STATUS_SUCCESS) {std::cout << "CUBLAS error: " << (int)stat << std::endl; return 1;}
cudaMemcpy(h_mat, d_mat, msz, cudaMemcpyDeviceToHost);
std::cout << "CUBLAS Column-major result: " << std::endl;
for (int i = 0; i < ROWS; i++){
for (int j = 0; j < COLS; j++) std::cout << h_mat[j*ROWS+i] << " ";
std::cout << std::endl;}
return 0;
}
$ nvcc -o t712 t712.cu -lcublas
$ ./t712
Row-major result:
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
Column-major result:
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
CUBLAS Column-major result:
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
$
For brevity of presentation, I've not included proper cuda error checking, but that is always a good idea any time you are having trouble with a CUDA code. As a proxy/shortcut, you can run your code with cuda-memcheck
as a quick check to see if there are any CUDA errors.
Note that we expect all 3 printouts to be identical because that is actually the correct way to display the matrix, regardless of whether the underlying storage is row-major or column-major. The difference in underlying storage is accounted for in the for-loops handling the display output.