0

Is there any simple way to define and access CUDA GPU 2D matrix?

Something like M[i][j].

Maybe there is some libraries already?

mrgloom
  • 20,061
  • 36
  • 171
  • 301
  • 1
    Take a look at the second part of @talonmies' answer to this question: [How can I add up two 2d (pitched) arrays using nested for loops?](http://stackoverflow.com/questions/6137218/how-can-i-add-up-two-2d-pitched-arrays-using-nested-for-loops). – Vitality Feb 12 '14 at 07:55
  • @JackOLantern I want to operate some abstractions like vector in thrust but matrix. – mrgloom Feb 12 '14 at 10:17
  • 1
    #define ][ *width+ ===> M[i][j] becomes M[i*width+j] on background maybe? – huseyin tugrul buyukisik Feb 12 '14 at 14:15
  • Then you need a higher level language layer. You may wish to try [Accelereyes' ArrayFire](http://www.accelereyes.com/products/arrayfire) or some libraries based on the expression templates technique, like [Newton](https://github.com/jaredhoberock/newton), which uses thrust behind. – Vitality Feb 12 '14 at 21:05
  • If you know the *width* of your (2D) array at compile time, you can [leverage the compiler to help you](http://stackoverflow.com/questions/12924155/sending-3d-array-to-cuda-kernel/12925014#12925014) and make it "easier" to pass multidimensional arrays between device and host. But for the general fully-dynamic case, the deep-copy process is more complicated. – Robert Crovella Feb 14 '14 at 18:17

1 Answers1

1

Usually in CUDA, you will have to convert your arrays to linear memory (in case of 2D arrays, they should be converted to linear memory using cudaMallocPitch)
If you insist on using the M[i][j] notation, you may allocate arrays on the device as "arrays of arrays". In this case, you will allocate each row of the array using cudaMalloc and then store the pointer to each row in an array of pointers. You will then have to allocate this array of pointers on the device!

Therefore, when you say M[i], it will give you the pointer for the ith row and you can use the [j] index of that pointer.

From the guy who was heavily looking into this stuff for the past 3 weeks (aka me!), take it that it's the worst thing you could do. The allocations are scattered all over the global memory and most probably none of them meet CUDA alignment requirements. Therefore the accesses are not fully coalesced and the access latency will kill your kernel's performance. Stick to the linear and pitched memory for best performance. The indexing convention may be a little confusing and awkward at first, but you will get used to it :-)

Maghoumi
  • 3,295
  • 3
  • 33
  • 49