0

I am working on a machine with an Nvidia GPU and Cuda8 and I have a C++ application that should compute the L1 distance between two vectors that are represented by std::vector<double>.

Currently, my code is not parallel at all and only uses the CPU:

double compute_l1_distance(const std::vector<double> &v1, const std::vector<double> &v2) {
    if (v1.size() != v2.size()) {
        return -1;
    }
    double result = 0;
    for (int i = 0 ; i < v1.size() i++) {
        double val = v1[i] - v2[i];
        if (val < 0) {
            val = 0 - val;
        }
        result += val;
    }

    return result;
}

How can I improve the performance of this computation? How can I utilize the GPU? Are there recommended libraries that will do the job fast using the GPU or using any other optimization?

SomethingSomething
  • 11,491
  • 17
  • 68
  • 126
  • 1
    yes search for the thrust library – aram Jan 31 '18 at 13:30
  • 1
    I will start with using STD::valarray as it's about linear algebra – RLT Jan 31 '18 at 13:34
  • 1
    Unless you have vectors with millions of entries, or unless you have millions of small vectors, the likelihood that GPU computing is on any way useful here is slim to say the least – talonmies Jan 31 '18 at 13:52
  • @talonmies what you're saying is that the cost of copying the data to the GPU is not worth it? My vectors only have 4096 values – SomethingSomething Jan 31 '18 at 14:06
  • No. I'm saying that the GPU is a latency hiding architecture and you need a huge volume of parallel work to hide all that latency. 4096*2 double precision ops is about 7 or 8 orders of magnitude too small – talonmies Jan 31 '18 at 14:28
  • My first suggestion would be to consider AVX. It operates on 4 doubles in parallel. That means it only needs 1024 iterations, not 4096. Also, `abs(x)` doesn't need a branch in AVX, as it's [`max(x,-x)`](https://stackoverflow.com/a/5993459/15416) – MSalters Jan 31 '18 at 14:56

2 Answers2

2

Using the thrust library, it would looks something like:

double compute_l1_distance(const std::vector<double> &v1, 
                           const std::vector<double> &v2) {
    if (v1.size() != v2.size()) {
        return -1;
    }
    thrust::device_vector<double> dv1 = v1;
    thrust::device_vector<double> dv2 = v2;
    auto begin = thrust::make_zip_iterator(thrust::make_tuple(
                                               dv1.begin(), dv2.begin())); 
    auto end   = thrust::make_zip_iterator(thrust::make_tuple(
                                               dv1.end(),   dv2.end()  )); 
    const auto l1 = [](const thrust::tuple<double,double>& arg) {
        return fabs(thrust::get<0>(arg) - thrust::get<1>(arg));
    }
    const auto add = [](double a, double b) { return a+b; }

    return thrust::transform_reduce(first, last, l1, 0.0, add);
}
1

I suggest you to use cuda and thrust for utilizing the GPU. And as far as performance is concerned. Yes, it will be faster.

Please look in to this post. It has a very clear description.

And if you have to call compute_l1_distance multiple times you can use pthread to call the method thus executing it in parallel.

HariUserX
  • 1,341
  • 1
  • 9
  • 17