Reducing GPU-CPU data transfers in C++Amp

Question

I have encountered a following problem when trying to optimize my application with C++Amp: the data transfers. For me, there is no problem with copying data from CPU to GPU (as I can do it in the initial state of the application). The worse thing is that I need a fast access to the results computed by C++Amp kernels so the bottleneck between GPU and CPU is a pain. I read that there is a performance boost under Windows 8.1, however I am using Windows 7 and I am not planing to change it. I read about staging arrays but I don't know how they could help solve my problem. I need to return a single float value to the host and it seems that it is the most time consuming operation.

float Subset::reduction_cascade(unsigned element_count, concurrency::array<float, 1>& a) 
{
static_assert(_tile_count > 0, "Tile count must be positive!");
//static_assert(IS_POWER_OF_2(_tile_size), "Tile size must be a positive integer power of two!");

assert(source.size() <= UINT_MAX);
//unsigned element_count = static_cast<unsigned>(source.size());
assert(element_count != 0); // Cannot reduce an empty sequence.

unsigned stride = _tile_size * _tile_count * 2;

// Reduce tail elements.
float tail_sum = 0.f;
unsigned tail_length = element_count % stride;
// Using arrays as a temporary memory.
//concurrency::array<float, 1> a(element_count, source.begin());
concurrency::array<float, 1> a_partial_result(_tile_count);

concurrency::parallel_for_each(concurrency::extent<1>(_tile_count * _tile_size).tile<_tile_size>(), [=, &a, &a_partial_result] (concurrency::tiled_index<_tile_size> tidx) restrict(amp)
{
    // Use tile_static as a scratchpad memory.
    tile_static float tile_data[_tile_size];

    unsigned local_idx = tidx.local[0];

    // Reduce data strides of twice the tile size into tile_static memory.
    unsigned input_idx = (tidx.tile[0] * 2 * _tile_size) + local_idx;
    tile_data[local_idx] = 0;
    do
    {
        tile_data[local_idx] += a[input_idx] + a[input_idx + _tile_size]; 
        input_idx += stride;
    } while (input_idx < element_count);

    tidx.barrier.wait();

    // Reduce to the tile result using multiple threads.
    for (unsigned stride = _tile_size / 2; stride > 0; stride /= 2)
    {
        if (local_idx < stride)
        {
            tile_data[local_idx] += tile_data[local_idx + stride];
        }

        tidx.barrier.wait();
    }

    // Store the tile result in the global memory.
    if (local_idx == 0)
    {
        a_partial_result[tidx.tile[0]] = tile_data[0];
    }
});

// Reduce results from all tiles on the CPU.
std::vector<float> v_partial_result(_tile_count);
copy(a_partial_result, v_partial_result.begin());
return std::accumulate(v_partial_result.begin(), v_partial_result.end(), tail_sum);  
}

I checked that in the example above the most time-consuming operation is copy(a_partial_result, v_partial_result.begin());. I am trying to find a better approach.

How are you timing the data copies vs. the compute parts of your code? Remember to some extent C++ AMP calls are asynchronous, they queue things to the DMA buffer and only block when needed. See the following answer for more discussion on timing http://stackoverflow.com/questions/13936994/copy-data-from-gpu-to-cpu/14013053#14013053 — Ade Miller, Feb 19 '14 at 23:44
I am timing it in the same way that I am timing non-parrallel methods. When I commented out the copy() method, I got a boost from 800-900 ms to 300 ms. — Paweł Jastrzębski, Feb 19 '14 at 23:54
If you are not forcing the AMP kernel to finish its compute by either copying the data or calling synchronize() or wait() then you may not be timing anything at all. See the link in my previous comment. — Ade Miller, Feb 20 '14 at 00:25
So after calling wait() explicitly I got: ~640 ms without copy() and ~1300 ms with copy(). What's even worse, my previous methods seem to to be slower than I expected after adding wait() everywhere. It's a really bad news. — Paweł Jastrzębski, Feb 20 '14 at 01:06

score 1 · Accepted Answer · answered Feb 21 '14 at 06:07

So I think there's something else going on here. Have you tried running the original sample on which your code is based? This is available on CodePlex.

Load the samples solution and build the Reduction project in Release mode and then run it without the debugger attached. You should see some output like this.

Running kernels with 16777216 elements, 65536 KB of data ...
Tile size:     512
Tile count:    128
Using device : NVIDIA GeForce GTX 570

                                                           Total : Calc

SUCCESS: Overhead                                           0.03 : 0.00 (ms)
SUCCESS: CPU sequential                                     9.48 : 9.45 (ms)
SUCCESS: CPU parallel                                       5.92 : 5.89 (ms)
SUCCESS: C++ AMP simple model                              25.34 : 3.19 (ms)
SUCCESS: C++ AMP simple model using array_view             62.09 : 20.61 (ms)
SUCCESS: C++ AMP simple model optimized                    25.24 : 1.81 (ms)
SUCCESS: C++ AMP tiled model                               29.70 : 7.27 (ms)
SUCCESS: C++ AMP tiled model & shared memory               30.40 : 7.56 (ms)
SUCCESS: C++ AMP tiled model & minimized divergence        25.21 : 5.77 (ms)
SUCCESS: C++ AMP tiled model & no bank conflicts           25.52 : 3.92 (ms)
SUCCESS: C++ AMP tiled model & reduced stalled threads     21.25 : 2.03 (ms)
SUCCESS: C++ AMP tiled model & unrolling                   22.94 : 1.55 (ms)
SUCCESS: C++ AMP cascading reduction                       20.17 : 0.92 (ms)
SUCCESS: C++ AMP cascading reduction & unrolling           24.01 : 1.20 (ms)

Note that none of the examples are taking anywhere near the time you code is. Although it's fair to say that the CPU is faster and data copy time is a big contributing factor here.

This is to be expected. Effective use of a GPU involves moving more than operations like reduction to the GPU. You need to move significant amount of compute to make up for the copy overhead.

Some things you should consider:

What happens with you run the sample from CodePlex?
Are you running a release build with optimization enabled?
Are you sure running are running against the actual GPU hardware and not against a WARP (software emulator) accelerator?

Some more information that would be helpful

what hardware are you using?
How large is your data set, both the input data and the size of the partial result array?

Did this help or are you still experiencing really slow copies? — Ade Miller, Feb 25 '14 at 15:29
Yes, it helped me a lot. It turned out that the tests that I was running were measuring in us (microseconds) not in milliseconds. That was the case. I want to optimize two methods (convolution calculation and another very simple mathematical equation). This mathematical equation on CPU is very fast (around 50 microseconds ~= 0.05 ms). Copying one float from concurrency::array<...> to CPU takes much more than 0.05 ms and I think it is about at least 0.9 ms so only copying the value makes the CPU-accelerated computations more than 10 times slower. Or maybe I am wrong here? — Paweł Jastrzębski, Feb 26 '14 at 00:10

Reducing GPU-CPU data transfers in C++Amp

1 Answers1