Cuda is awesome and I'm using it like crazy but I am not using her full potential because I'm having a issue transferring memory and was wondering if there was a better way to get a variable amount of memory out. Basically I send 65535 item array into Cuda and Cuda analyzes each data item around 20,000 different ways and if there's a match in my programs logic then it saves a 30 int list as a result. Think of my logic of analyzing each different combination and then looking at the total and if the total is equal to a number I'm looking for then it saves the results (which is a 30 int list for each analyzed item).
The problem is 65535 (blocks/items in data array) * 20000 (total combinations tested per item) = 1,310,700,000. This means I need to create a array of that size to deal with the chance that all the data will be a positive match (which is extremely unlikely and creating int output[1310700000][30]
seems crazy for memory). I've been forced to make it smaller and send less blocks to process because I don't know how if Cuda can write efficiently to a linked list or a dynamically sized list (with this approach the it writes the output to host memory using block * number_of_different_way_tests).
Is there a better way to do this? Can Cuda somehow write to free memory that is not derived from the blockid? When I test this process on the CPU, less then 10% of the item array have a positive match so its extremely unlikely I'll use so much memory each time I send work to the kernel.
p.s. I'm looking above and although it's exactly what I'm doing, if it's confusing then another way of thinking about it (not exactly what I'm doing but good enough to understand the problem) is I am sending 20,000 arrays (that each contain 65,535 items) and adding each item with its peer in the other arrays and if the total equals a number (say 200-210) then I want to know the numbers it added to get that matching result.
If the numbers are very widely range then not all will match but using my approach I'm forced to malloc that huge amount of memory. Can I capture the results with mallocing less memory? My current approach to is malloc as much as I have free but I'm forced to run less blocks which isn't efficient (I want to run as many blocks and threads a time because I like the way Cuda organizes and runs the blocks). Is there any Cuda or C tricks I can use for this or I'm a stuck with mallocing the max possible results (and buying a lot more memory)?