4

Cuda is awesome and I'm using it like crazy but I am not using her full potential because I'm having a issue transferring memory and was wondering if there was a better way to get a variable amount of memory out. Basically I send 65535 item array into Cuda and Cuda analyzes each data item around 20,000 different ways and if there's a match in my programs logic then it saves a 30 int list as a result. Think of my logic of analyzing each different combination and then looking at the total and if the total is equal to a number I'm looking for then it saves the results (which is a 30 int list for each analyzed item).

The problem is 65535 (blocks/items in data array) * 20000 (total combinations tested per item) = 1,310,700,000. This means I need to create a array of that size to deal with the chance that all the data will be a positive match (which is extremely unlikely and creating int output[1310700000][30] seems crazy for memory). I've been forced to make it smaller and send less blocks to process because I don't know how if Cuda can write efficiently to a linked list or a dynamically sized list (with this approach the it writes the output to host memory using block * number_of_different_way_tests).

Is there a better way to do this? Can Cuda somehow write to free memory that is not derived from the blockid? When I test this process on the CPU, less then 10% of the item array have a positive match so its extremely unlikely I'll use so much memory each time I send work to the kernel.

p.s. I'm looking above and although it's exactly what I'm doing, if it's confusing then another way of thinking about it (not exactly what I'm doing but good enough to understand the problem) is I am sending 20,000 arrays (that each contain 65,535 items) and adding each item with its peer in the other arrays and if the total equals a number (say 200-210) then I want to know the numbers it added to get that matching result.

If the numbers are very widely range then not all will match but using my approach I'm forced to malloc that huge amount of memory. Can I capture the results with mallocing less memory? My current approach to is malloc as much as I have free but I'm forced to run less blocks which isn't efficient (I want to run as many blocks and threads a time because I like the way Cuda organizes and runs the blocks). Is there any Cuda or C tricks I can use for this or I'm a stuck with mallocing the max possible results (and buying a lot more memory)?

halfer
  • 19,824
  • 17
  • 99
  • 186
Lostsoul
  • 25,013
  • 48
  • 144
  • 239
  • How about a few lines of code? – wallyk Jun 23 '12 at 02:17
  • If the positive matches are few, the alternative procedure that I describe in [this answer](http://stackoverflow.com/q/11148860/442006) may be a good fit. You would increase the index by 30 each time. – Roger Dahl Jun 23 '12 at 02:24
  • hey @wallyk Is there anything unclear that I can explain more? or if you feel code is needed then I can work on creating a sperate sample(as my actual codebase is quite large and dependent on other files, I felt it would be more confusing then helpful). – Lostsoul Jun 23 '12 at 02:25
  • Thanks @RogerDahl you saved the day again! I'll check out your answer and let you know how it turns out(I saw it before but didn't realize it applied to my problem). – Lostsoul Jun 23 '12 at 02:26
  • @harrism I think Rogers answer will work. I've been running around alot and haven't had a chance to implement this yet. I will do it over the weekend and report back(if Roger doesn't write a answer then I'll write out the steps I did as a answer). – Lostsoul Jun 27 '12 at 15:20
  • @Lostsoul I also think Roger Dahl's answer should work. I'm curious why no (accepted) answer yet after a month. If that answer worked, it would be nice for this question to not show up in unanswered. In case it's useful to someone (OP or otherwise), I also recently answered a [similar question](http://stackoverflow.com/questions/11534862/i-want-change-my-code-from-cpp-to-cuda-any-idea/11536423) where I use thrust::remove_if(), and I provided some sample code as well. Cheers. – Nicu Stiurca Jul 22 '12 at 20:01
  • @SchighSchagh sorry about that, this question slipped my mind..I added the answer and accepted it. Thanks. – Lostsoul Jul 22 '12 at 20:15

1 Answers1

0

As Per Roger Dahl's great answer: The functionality you're looking for is called stream compaction.

You probably do need to provide an array that contains room for 4 solutions per thread because attempting to directly store the results in a compact form is likely to create so many dependencies between the threads that the performance gained in being able to copy less data back to the host is lost by a longer kernel execution time. The exception to this is if almost all of the threads find no solutions. In that case, you might be able to use an atomic operation to maintain an index into an array. So, for each solution that is found, you would store it in an array at an index and then use an atomic operation to increase the index. I think it would be safe to use atomicAdd() for this. Before storing a result, the thread would use atomicAdd() to increase the index by one. atomicAdd() returns the old value, and the thread can store the result using the old value as the index.

However, given a more common situation, where there's a fair number of results, the best solution will be to perform a compacting operation as a separate step. One way to do this is with thrust::copy_if. See this question for some more background.

Community
  • 1
  • 1
Lostsoul
  • 25,013
  • 48
  • 144
  • 239