How to copy dynamic matrix to device memory in CUDA?

Question

In my code I have dynamic matrix.

int ** file_data = (int **)malloc(TRANSACTIONS * sizeof(int *));
file_data[0] = (int *)malloc((a_size+1) * sizeof(int));
file_data[1] = (int *)malloc((a_size+1) * sizeof(int));
file_data[2] = (int *)malloc((a_size+1) * sizeof(int));
................................................................

I want to copy it to device global memory only once.

I have used:

__device__ int raw_data[][];
...................................
...................................
...................................
cudaMemcpyToSymbol(raw_data[i], file_data[i], (a_size+1)*sizeof(int));

But this do not works.

How can I do it?

If you have multiple `malloc` statements in your host code as you have shown here, a separate one for each row, there is **no safe way** to copy that data to the device using only a single `cudaMemcpy` or `cudaMemcpyToSymbol` operation. It cannot be done. If you want to use a single copy operation, you will need to make some changes. A large variety of techniques are covered [here](https://stackoverflow.com/questions/45643682/cuda-using-2d-and-3d-arrays/45644824#45644824). — Robert Crovella, May 29 '19 at 18:59

score 2 · Accepted Answer · edited Jun 20 '20 at 09:12

You need to Flatten the data

If you're only working with rectangular matrices in the first place, I'd recommend always storing your Matrixes like this anyways, but either way, you'll need to get it into this form before trying to push this data to your device memory.

template<typename T>
class Matrix {
    std::vector<T> _data;
    size_t rows, columns;
public:
    Matrix(size_t rows, size_t columns) :rows(rows), columns(columns) {
        _data.resize(rows * columns);
    }

    T & operator()(size_t row, size_t column) & {
        return _data.at(row * columns + column); //Row-Major Ordering
    }
    
    T const& operator()(size_t row, size_t column) const& {
        return _data.at(row * columns + column);
    }

    T operator() size_t row, size_t column) const {
        return _data.at(row * columns + column);
    }

    T * data() & {
        return _data.data();
    }
    
    T const* data() const& {
        return _data.data();
    }

    std::pair<size_t, size_t> size() const {
        return {rows, columns};
    }

    size_t flat_size() const {
        return rows * columns;
    }

    size_t byte_size() const {
        return flat_size() * sizeof(T);
    }
};

int ** file_data = (int **)malloc(TRANSACTIONS * sizeof(int *));
file_data[0] = (int *)malloc((a_size+1) * sizeof(int));
file_data[1] = (int *)malloc((a_size+1) * sizeof(int));
file_data[2] = (int *)malloc((a_size+1) * sizeof(int));
//................................................................

Matrix<int> flat_data(TRANSACTIONS, a_size + 1);
for(size_t row = 0; row < TRANSACTIONS; row++) {
    for(size_t column = 0; column < a_size + 1; column++) {
        flat_data(row, column) = file_data[row][column];
    }
}
//ALTERNATIVE: use this instead of your manual mallocs in the first place!

cudaMemcpyToSymbol(flat_data.data(), /*buffer name*/, flat_data.byte_size());

This has the major advantage that you're not having to copy each row individually into their own buffers, you can put all of them together in memory, saving memory and reducing the number of API calls you need to make. And a class designed specifically to handle your functionality won't break when you inevitably make a mistake trying to manually handle all the pointer management in your original code.

@user10933809 Such a solution is almost never advisable. Most Heterogenous Computing environments have not-especially-high limits on the number of buffers that can be individually addressed for a single kernel. Unless the OP's dataset is expressly limited to a very small number (i.e. `TRANSACTIONS` is less than 100) that kind of solution simply is not viable. What I've done is provide a helpful framework to quickly convert their host-side "array of pointers" into a flat array that will fit the data layout that device data buffers expect. — Xirema, May 29 '19 at 18:25
no doubt the array must be flattened here. I was just wondering if you could provide a solution with array of pointers as well. — Oblivion, May 29 '19 at 18:29
@user10933809 As I said: that kind of solution is not viable. — Xirema, May 29 '19 at 18:36
@Armeno That's fine; the solution works as-is provided that the number you pass as the number of columns is equal to the largest row in your input data. — Xirema, May 29 '19 at 18:50
@Armeno Well, if your constraint is "I want to only copy it to device memory once", I don't know what other solution exists... — Xirema, May 29 '19 at 18:53

How to copy dynamic matrix to device memory in CUDA?

1 Answers1

You need to Flatten the data