Boost.Compute slower than plain CPU?

Question

I just started to play with Boost.Compute, to see how much speed it can bring to us, I wrote a simple program:

#include <iostream>
#include <vector>
#include <algorithm>
#include <boost/foreach.hpp>
#include <boost/compute/core.hpp>
#include <boost/compute/platform.hpp>
#include <boost/compute/algorithm.hpp>
#include <boost/compute/container/vector.hpp>
#include <boost/compute/functional/math.hpp>
#include <boost/compute/types/builtin.hpp>
#include <boost/compute/function.hpp>
#include <boost/chrono/include.hpp>

namespace compute = boost::compute;

int main()
{
    // generate random data on the host
    std::vector<float> host_vector(16000);
    std::generate(host_vector.begin(), host_vector.end(), rand);

    BOOST_FOREACH (auto const& platform, compute::system::platforms())
    {
        std::cout << "====================" << platform.name() << "====================\n";
        BOOST_FOREACH (auto const& device, platform.devices())
        {
            std::cout << "device: " << device.name() << std::endl;
            compute::context context(device);
            compute::command_queue queue(context, device);
            compute::vector<float> device_vector(host_vector.size(), context);

            // copy data from the host to the device
            compute::copy(
                host_vector.begin(), host_vector.end(), device_vector.begin(), queue
            );

            auto start = boost::chrono::high_resolution_clock::now();
            compute::transform(device_vector.begin(),
                       device_vector.end(),
                       device_vector.begin(),
                       compute::sqrt<float>(), queue);

            auto ans = compute::accumulate(device_vector.begin(), device_vector.end(), 0, queue);
            auto duration = boost::chrono::duration_cast<boost::chrono::milliseconds>(boost::chrono::high_resolution_clock::now() - start);
            std::cout << "ans: " << ans << std::endl;
            std::cout << "time: " << duration.count() << " ms" << std::endl;
            std::cout << "-------------------\n";
        }
    }
    std::cout << "====================plain====================\n";
    auto start = boost::chrono::high_resolution_clock::now();
    std::transform(host_vector.begin(),
                host_vector.end(),
                host_vector.begin(),
                [](float v){ return std::sqrt(v); });

    auto ans = std::accumulate(host_vector.begin(), host_vector.end(), 0);
    auto duration = boost::chrono::duration_cast<boost::chrono::milliseconds>(boost::chrono::high_resolution_clock::now() - start);
    std::cout << "ans: " << ans << std::endl;
    std::cout << "time: " << duration.count() << " ms" << std::endl;

    return 0;
}

And here's the sample output on my machine (win7 64-bit):

====================Intel(R) OpenCL====================
device: Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz
ans: 1931421
time: 64 ms
-------------------
device: Intel(R) HD Graphics 4600
ans: 1931421
time: 64 ms
-------------------
====================NVIDIA CUDA====================
device: Quadro K600
ans: 1931421
time: 4 ms
-------------------
====================plain====================
ans: 1931421
time: 0 ms

My question is: why is the plain (non-opencl) version faster?

You may take a look at http://stackoverflow.com/questions/23901979/performance-boost-compute-v-s-opencl-c-wrapper — cqdjyy01234, Jun 18 '14 at 08:20
Without even reading the code, your samples are too small for a performance comparison... — AK_, Jun 18 '14 at 08:21
@Jamboree So don't you think the gap comes from the compilation of the kernal? — cqdjyy01234, Jun 18 '14 at 08:28
@AK_, I'm not sure what a fair number would be, too large will cause the driver crash, plain CPU is still the fastest after I raised the number to, say, 1600000. — Jamboree, Jun 18 '14 at 08:33
@user1535111, I removed compute::transform and plain CPU is still faster, what do you say? — Jamboree, Jun 18 '14 at 08:36
@Jamboree Have you removed boost::accumulate which needs a kernal as well? — cqdjyy01234, Jun 18 '14 at 08:38
@Jamboree It seems boost::compute will cache the compiled kernal, so you may use boost::transform and boost::accumulate before timing for the first time. — cqdjyy01234, Jun 18 '14 at 08:39
@user1535111, I'm not aware of that compute::accumulate also needs a kernal. — Jamboree, Jun 18 '14 at 08:43
@Jamboree by samples i didn't mean the vectors, I meant the entire operation. a single execution that lasts 64ms is way to short to measure performance. I'm not familiar enough with boost::compute to know exactly what it does internally and how much time it should take, but from what i remember they compile and cache the execution code at runtime, so you probably need to run it a couple of thousands of times to get performance data that makes sense. also i wouldn't use squert, I would take something like matrix multiplication, or maybe FFT — AK_, Jun 18 '14 at 10:46

Kyle Lutz · Answer 1 · 2015-05-08T04:54:41.723

As others have said, there is most likely not enough computation in your kernel to make it worthwhile to run on the GPU for a single set of data (you're being limited by kernel compilation time and transfer time to the GPU).

To get better performance numbers, you should run the algorithm multiple times (and most likely throw out the first one as that will be far greater because it includes the time to compile and store the kernels).

Also, instead of running transform() and accumulate() as separate operations, you should use the fused transform_reduce() algorithm which performs both the transform and reduction with a single kernel. The code would look like this:

float ans = 0;
compute::transform_reduce(
    device_vector.begin(),
    device_vector.end(),
    &ans,
    compute::sqrt<float>(),
    compute::plus<float>(),
    queue
);
std::cout << "ans: " << ans << std::endl;

You can also compile code using Boost.Compute with the -DBOOST_COMPUTE_USE_OFFLINE_CACHE which will enable the offline kernel cache (this requires linking with boost_filesystem). Then the kernels you use will be stored in your file system and only be compiled the very first time you run your application (NVIDIA on Linux already does this by default).

`transform_reduce` does perform better in this case, I also tried the equivalent `accumulate` with custom function but it's not as good as `transform_reduce`, and the results are somewhat different. — Jamboree, Jun 19 '14 at 02:16
That's expected. For floating point addition (which is not commutative like integer addition), `accumulate()` will use a slower, non-parallel code path. — Kyle Lutz, Jun 19 '14 at 02:35

score 2 · Answer 2 · answered Jun 18 '14 at 08:47

I can see one possible reason for the big difference. Compare the CPU and the GPU data flow:-

CPU              GPU

                 copy data to GPU

                 set up compute code

calculate sqrt   calculate sqrt

sum              sum

                 copy data from GPU

Given this, it appears that the Intel chip is just a bit rubbish at general compute, the NVidia is probably suffering from the extra data copying and setting up the GPU to do the calculation.

You should try the same program but with a much more complex operation - sqrt and sum are too simple to overcome the extra overhead of using the GPU. You could try calculating Mandlebrot points for instance.

In your example, moving the lambda into the accumulate would be faster (one pass over memory vs. two passes)

score 1 · Answer 3 · answered Jun 18 '14 at 15:21

You're getting bad results because you're measuring time incorrectly.

OpenCL Device has it's own time counters, which aren't related to Host counters. Every OpenCL task has 4 states, timers for which can be queried: (from Khronos web site)

CL_PROFILING_COMMAND_QUEUED, when the command identified by event is enqueued in a command-queue by the host
CL_PROFILING_COMMAND_SUBMIT, when the command identified by event that has been enqueued is submitted by the host to the device associated with the command-queue.
CL_PROFILING_COMMAND_START, when the command identified by event starts execution on the device.
CL_PROFILING_COMMAND_END, when the command identified by event has finished execution on the device.

Take into account, that timers are Device-side. So, to measure kernel & command queue performance, you can query for these timers. In your case, 2 last timers are needed.

In your sample code, you're measuring Host time, which includes data transfer time (as Skizz said) plus all time wasted on Command Queue maintenance.

So, to learn actual kernel performance, you need either to pass cl_event to your kernel (no idea how to do it in boost::compute) & query that event for performance counters or make your kernel really huge & complicated to hide all overheads.

I meant to measure the Host time, because I want to know how OpenCL performs compared to the normal solution. I think the Device-side performance counter is better for comparing different algorithms written in OpenCL. — Jamboree, Jun 19 '14 at 02:41

Boost.Compute slower than plain CPU?

3 Answers3