How to shuffle data efficiently within device memory?

Question

Question

When moving many random (non-coalesced) values within a device global memory, which is the most efficient way to do it?

Note: Many values as in > 500.

Context

I've been working in a Genetic Algorithm implementation for GPU's for some time now, and I've been trying struggling between the flexibility of my framework, and micro optimizing for the GPU architecture. The GA data resides in the GPU at all times. Only best generational solutions are copied to the host memory.

Detailed scenario

I'm optimizing the migration function. Here basically little data is shuffled within the device Global Memory. But I've got my data order in such way it's coalesced for the GA operators kernel threads' memory access scheme, which makes shuffling a pair of "genomes", a matter of striding for single FLOAT values, and swapping them with another genome in the same striding fashion.

Known solutions

The problem isn't memory bandwidth, but a matter of call latency and thread blocks stalling the process.

I wrote a couple of device kernels, which function is merely to move values among the addresses. This would launch a kernel (with VERY low occupancy, divergent code and random memory access... therefor the little code it runs, would be serialized), but would do the work with only two kernel calls to the device.
- 1st Kernel Copies values to buffer array.
- 2nd Kernel Swaps values.
I'm aware I could use cudaMemcpy for every value, but that would require many calls to cudaMemCpy, which I think to be synchronous calls.

Simplified code example:

int needed_genome_idx = 0; // Some random index.
for(int nth_gene = 0; nth_gene < num_genes; ++nthgene)
{
  cudaMemcpy(genomes_buffer + nth_gene,
             src + needed_genome_idx + nth_gene * stride_size, // stride_size being a big number, usually equal to the size of the GA population.
             sizeof(float),
             cudaMemCpyDeviceToDevice);
}

Is this a viable solution? Would using cudaMemCpyAsync help performance?

Is there a better way, or at least more elegant way, to do such memory operations?

What is the length of your genome vector? How many pairs of genomes do you need to shuffle? When you are shuffling one pair of the genome vectors, how many elements in the vector should be swapped? How these elements are chosen? (randomly or a range?) How is your genomes stored in the global mem? Could you give a more specific code sample? — kangshiyin, Aug 24 '13 at 08:32
copying only 1 float number with 1 `cudaMemcpy()` call is obviously a VERY inefficient way — kangshiyin, Aug 24 '13 at 08:36
I'm implementing the framework, and I'm aiming for flexibility. The nature of the GPU implementation calls for a GA setup using plenty of isles, with large populations and high dimensionality genomes, so it makes sense using the GPU instead of the CPU. So these details you are asking for, would vary. But typically I'm using the following: The genome vector would be in the 10-50 range, the genomes to shuffle would be most likely one per isle, ~16 (but could be a larger number), and the whole vector would be swapped in each migration operation (usually after a certain number of generations). — RdelCueto, Aug 24 '13 at 08:46
I agree, using cudaMemcpy() seemed like a terrible idea. I thought maybe by using the asynchronous version, the driver's magic would bulk all the calls into a single (or at least fewer) transfers. I also thought of using cudaMemcpy2D(), which is used for strided access to matrix data. But I was cautious not to, since the _genomes memory_ isn't allocated pitch device memory (using cudaMallocPitch). — RdelCueto, Aug 24 '13 at 08:58
it will be no big differences if you still copy one float number at a time. generally `cudaMemcpy()`/`cmdaMemcpyAsync()` will reach its max speed when the size of the data > 100 KBbytes as shown in the D2D copy speed here http://stackoverflow.com/questions/17729351/cuda-how-much-slower-is-transferring-over-pci-e/17745946#17745946 — kangshiyin, Aug 24 '13 at 09:29
`cudaMemcpyDeviceToDevice` can be slower than writing a kernel to do the same operation, see [CUDA Device To Device transfer expensive](http://stackoverflow.com/questions/6063619/cuda-device-to-device-transfer-expensive). Your code isn't very clear to me: `genomes_buffer` and `src` are undefined and `needed_genome_idx` is claimed to be random, but set to `0`. Could you be a bit more clear on what actually "migration" means for you? — Vitality, Aug 25 '13 at 08:14
The code sample was intended only to illustrate how I was planning on using cudaMemcpy(), inside a loop, to get the strided single values. What the migration operator does is: Select random genomes from the population, and swap them among isles. This implies first copying them to a temporal buffer, and then replacing them with some migration scheme. Hence I have to move singled strided values back and forth. I wondered if maybe by using cudaMemcpy2D, this could be done efficiently, but that would still require one call per genome. I think I'm sticking to the _copying kernel_ solution. — RdelCueto, Aug 25 '13 at 09:18

score 2 · Accepted Answer · answered Sep 02 '13 at 07:41

2

You can try to write a kernel to complete the shuffle, maybe more efficient than call cudaMemcpy so many times.

answered Sep 02 '13 at 07:41

Kill Console

1,933
18
15

How to shuffle data efficiently within device memory?

Question

Context

Detailed scenario

Known solutions

1 Answers1