0

I'm wondering if it is possible to write a persistent GPU function. I have my doubts, but I'm not sure how the scheduler works.

I'm looking to process an unknown number of data points, (approx 50 million). The data arrives in chunks of 20 or so. It would be good if I could drop these 20 points into a GPU 'bucket', and have this 'persistent' operation grab and process them as they come in. When done, grab the result.
I could keep the GPU busy w/ dummy data while the bucket is empty. But I think race conditions on a partially empty bucket will be an issue.
I suspect I wouldn't be able to run any other operations on the GPU while this persistent operation is running. i.e. Put other undedicated SM's to work.

Is this a viable (fermi) GPU approach, or just a bad idea?

Doug
  • 2,783
  • 6
  • 33
  • 37
  • It's a bit ambiguous what you want to do; when you say "process", is this processing independant of the newly arriving data? Also, what's your timing requirements? Is something preventing you from collecting all the points and processing them at once? In your current form, I'd say this wouldn't constitute a good use of the GPU, running on 20 elements at a time will generally be better for the CPU (although then again it depends on what you're trying to do) – alrikai Feb 06 '13 at 22:14
  • All 50million data points are independent of each other. They all undergo the same process. They all contribute to the single result. Sending the full 50million points in one chunk yields no speedup. Xfering in 20point chunks gives the least CPU overhead. – Doug Feb 06 '13 at 22:51
  • so you can perform this operation independantly, but you'll get a scalar result? Is there anything preventing you from doing multiple kernel invocations as you accumulate data and storing your running partial result in global memory? And what I was trying to convey with the bit about running on 20 elements is that it's more efficient for you to accumulate something larger (e.g. a couple thousand) and send them to the GPU at once. But without knowing the specifics of your program I can't make specific recommendations. – alrikai Feb 07 '13 at 00:51
  • If I understand correctly, you want to process data as a flow ? I don't know if it is possible but you can copy data in global memory when the GPU is working using a different stream. Therefore, when the GPU will finish its operation, it can directly start a new one without waiting the data to be copied. It would be preferable to process a bigger set of data than 20 elements. – Seltymar Feb 08 '13 at 01:06
  • I can also mention that there may be a timelimit for kernel execution. Old GPUs had no way to check if program works fine or something bad happened, so driver terminates kernel execution after some timeout (several seconds). You should check if this feature is turned on in your system. – Oleg Titov Feb 08 '13 at 09:22
  • Since you have to actively copy the data to the GPU, you know when new data has arrived. So, after the copy, you just launch a kernel on the new data. You can allocate the entire buffer up front and copy the data into it as it arrives. In the kernel parameters, you can include a pointer to the new data, and its size. Launching a kernel is a fast operation. – Roger Dahl Feb 08 '13 at 14:35
  • @OlegTitov This happened on windows because of the timeout recovery (it can be desactivate). But does it still happen if you use 2 graphic cards ? One to compute data and one for the display. – Seltymar Feb 12 '13 at 02:16
  • You should look documents link in this question http://stackoverflow.com/questions/14821029/persistent-thread-in-opencl-or-cuda – Seltymar Feb 14 '13 at 04:32
  • Persistent kernels are possible. Take a look [here](http://stackoverflow.com/questions/33150040/doubling-buffering-in-cuda-so-the-cpu-can-operate-on-data-produced-by-a-persiste/33158954#33158954). – Robert Crovella Nov 01 '15 at 15:05

1 Answers1

1

I'm not sure whether this persistent kernel is possible, but it surely would be very inefficient. Although the idea is elegant, it doesn't fit the GPU: You would have to globally communicate which thread picks which element out of the bucket, some threads might never even start as they wait for others to finish and the bucket would have to be declared volatile and therefore slow down your entire input data.

A more common solution to your problem is to divide the data into chunks and asynchronosly copy the chunks onto the GPU. You would use two streams, one working on the last sent chunk and the other sending a new chunk from the host. This actually will be done simultaneously. That way you are likely to hide most of the transfer. But don't let the chunks become too small or your kernel will suffer from low occupancy and performance will degrade.