8

I'm just starting to use Julia's CUDArt package to manage GPU computing. I am wondering how to ensure that if I go to pull data from the gpu (e.g. using to_host()) that I don't do so before all of the necessary computations have been performed on it.

Through some experimentation, it seems that to_host(CudaArray) will lag while the particular CudaArray is being updated. So, perhaps just using this is enough to ensure safety? But it seems a bit chancy.

Right now, I am using the launch() function to run my kernels, as depicted in the package documentation.

The CUDArt documentation gives an example using Julia's @sync macro, which seems like it could be lovely. But for the purposes of @sync I am done with my "work" and ready to move on as soon as the kernel gets launched with launch(), not once it finishes. As far as I understand the operation of launch() - there isn't a way to change this feature (e.g. to make it wait to receive the output of the function it "launches").

How can I accomplish such synchronization?

Michael Ohlrogge
  • 10,559
  • 5
  • 48
  • 76

2 Answers2

10

Ok, so, there isn't a ton of documentation on the CUDArt package, but I looked at the source code and I think it looks straightforward on how to do this. In particular, it appears that there is a device_synchronize() function that will block until all of the work on the currently active device has finished. Thus, the following in particular seems to work:

using CUDArt
md = CuModule("/path/to/module.ptx",false)
MyFunc = CuFunction(md,"MyFunc")
GridDim = 2*2496
BlockDim = 64
launch(MyFunc, GridDim, BlockDim, (arg1, arg2, ...)); 
device_synchronize()
res = to_host(arg2)

I'd love to hear from anyone with more expertise though if there is anything more to be aware of here.

Michael Ohlrogge
  • 10,559
  • 5
  • 48
  • 76
  • 7
    Just a word of caution: undocumented functions may be there for internal use (thus the lack of documentation). Obviously this is not always the case, but, if you use these functions, you might be setting yourself up for trouble when upgrading to the next version. Package and library owners tend to feel more at ease about heavily modifying or outright removing undocumented features, even between minor releases. Proceed with caution and be sure to write regression tests. – JDB Jun 21 '16 at 21:50
1

I think the more canonical way is to make a stream for each device:

streams = [(device(dev); Stream()) for dev in devlist]

and then inside the @async block, after you tell it to do the computations, you use the wait(stream) function to tell it to wait for that stream to finish its computations. See the Streams example in the README.

Chris Rackauckas
  • 18,645
  • 3
  • 50
  • 81
  • Good point. I think `device_synchronize` can still be useful in a number of settings. 1. You can use it along with other functions, like those from CUBLAS, CUSPARSE, etc. which don't take streams as arguments. Also, if you're just working with a single GPU, you might well not even need streams, and so `device_synchronize` can lead to a bit simpler applications. – Michael Ohlrogge Jul 16 '16 at 23:11