0

In my code I have dynamic matrix.

int ** file_data = (int **)malloc(TRANSACTIONS * sizeof(int *));
file_data[0] = (int *)malloc((a_size+1) * sizeof(int));
file_data[1] = (int *)malloc((a_size+1) * sizeof(int));
file_data[2] = (int *)malloc((a_size+1) * sizeof(int));
................................................................

I want to copy it to device global memory only once.

I have used:

__device__ int raw_data[][];
...................................
...................................
...................................
cudaMemcpyToSymbol(raw_data[i], file_data[i], (a_size+1)*sizeof(int));

But this do not works.

How can I do it?

Armeno
  • 11
  • 6
  • And what is wrong with `cudaMalloc` and `cudaMemcpy`? – pptaszni May 29 '19 at 18:14
  • 2
    If you have multiple `malloc` statements in your host code as you have shown here, a separate one for each row, there is **no safe way** to copy that data to the device using only a single `cudaMemcpy` or `cudaMemcpyToSymbol` operation. It cannot be done. If you want to use a single copy operation, you will need to make some changes. A large variety of techniques are covered [here](https://stackoverflow.com/questions/45643682/cuda-using-2d-and-3d-arrays/45644824#45644824). – Robert Crovella May 29 '19 at 18:59
  • @Ptaq666 it gives "CUDA error: invalid device symbol". – Armeno May 29 '19 at 19:08

1 Answers1

2

You need to Flatten the data

If you're only working with rectangular matrices in the first place, I'd recommend always storing your Matrixes like this anyways, but either way, you'll need to get it into this form before trying to push this data to your device memory.

template<typename T>
class Matrix {
    std::vector<T> _data;
    size_t rows, columns;
public:
    Matrix(size_t rows, size_t columns) :rows(rows), columns(columns) {
        _data.resize(rows * columns);
    }

    T & operator()(size_t row, size_t column) & {
        return _data.at(row * columns + column); //Row-Major Ordering
    }
    
    T const& operator()(size_t row, size_t column) const& {
        return _data.at(row * columns + column);
    }

    T operator() size_t row, size_t column) const {
        return _data.at(row * columns + column);
    }

    T * data() & {
        return _data.data();
    }
    
    T const* data() const& {
        return _data.data();
    }

    std::pair<size_t, size_t> size() const {
        return {rows, columns};
    }

    size_t flat_size() const {
        return rows * columns;
    }

    size_t byte_size() const {
        return flat_size() * sizeof(T);
    }
};

int ** file_data = (int **)malloc(TRANSACTIONS * sizeof(int *));
file_data[0] = (int *)malloc((a_size+1) * sizeof(int));
file_data[1] = (int *)malloc((a_size+1) * sizeof(int));
file_data[2] = (int *)malloc((a_size+1) * sizeof(int));
//................................................................

Matrix<int> flat_data(TRANSACTIONS, a_size + 1);
for(size_t row = 0; row < TRANSACTIONS; row++) {
    for(size_t column = 0; column < a_size + 1; column++) {
        flat_data(row, column) = file_data[row][column];
    }
}
//ALTERNATIVE: use this instead of your manual mallocs in the first place!

cudaMemcpyToSymbol(flat_data.data(), /*buffer name*/, flat_data.byte_size());

This has the major advantage that you're not having to copy each row individually into their own buffers, you can put all of them together in memory, saving memory and reducing the number of API calls you need to make. And a class designed specifically to handle your functionality won't break when you inevitably make a mistake trying to manually handle all the pointer management in your original code.

Community
  • 1
  • 1
Xirema
  • 19,889
  • 4
  • 32
  • 68
  • @user10933809 Such a solution is almost never advisable. Most Heterogenous Computing environments have not-especially-high limits on the number of buffers that can be individually addressed for a single kernel. Unless the OP's dataset is expressly limited to a very small number (i.e. `TRANSACTIONS` is less than 100) that kind of solution simply is not viable. What I've done is provide a helpful framework to quickly convert their host-side "array of pointers" into a flat array that will fit the data layout that device data buffers expect. – Xirema May 29 '19 at 18:25
  • no doubt the array must be flattened here. I was just wondering if you could provide a solution with array of pointers as well. – Oblivion May 29 '19 at 18:29
  • @user10933809 As I said: that kind of solution is not viable. – Xirema May 29 '19 at 18:36
  • No it is not rectangular matrix @Xirema. – Armeno May 29 '19 at 18:41
  • @Armeno That's fine; the solution works as-is provided that the number you pass as the number of columns is equal to the largest row in your input data. – Xirema May 29 '19 at 18:50
  • Thanks @Xirema but I am looking for another method. – Armeno May 29 '19 at 18:51
  • 1
    @Armeno Well, if your constraint is "I want to only copy it to device memory once", I don't know what other solution exists... – Xirema May 29 '19 at 18:53
  • @oblivion no.there is no any – Armeno Jun 02 '19 at 19:22