7

I'm using OpenCV for an application in computer vision. I'd like to accelerate some matrix operations (matrices are fairly large) on GPU and want to avoid coding directly in CUDA C, if possible. OpenCV 2.4.1 has a number of GPU accelerated functions. How well do they perform in your experience? Am I better off using another library (e.g. Thrust) instead?

EDIT Sample application: Calculate squared Euclidean distance matrix on GPU. Currently, my GPU accelerated (and vectorized) implementation in Matlab using the Parallel Computing Toolbox (PCT) is about 5-10 times faster than my C++ implementation with OpenCV.

Matlab implementation:

function K = sqEuclideanDist(P_cpu,Q_cpu)
% Vectorized method to compute pairwise squared Euclidean distance on GPU
% Returns K(i,j) = (P(i,:) - Q(j,:))'*(P(i,:) - Q(j,:))

P_gpu = gpuArray(P_cpu);
Q_gpu = gpuArray(Q_cpu);

[nP, d] = size(P_gpu);
[nQ, d] = size(Q_gpu);

pmag = sum(P_gpu .* P_gpu, 2);
qmag = sum(Q_gpu .* Q_gpu, 2);

% note that K is on GPU
K = ones(nP,1)*qmag' + pmag*ones(1,nQ) - 2*P_gpu*Q_gpu';

end

UPDATE Here's another Matlab implementation that accomplishes the same (thanks to https://stackoverflow.com/a/7774323/1121420). But it runs only on CPU because bsxfun is not supported by PCT. Still looking for C++ alternative though.

function K = sqEuclideanDist(P_cpu,Q_cpu)
% Returns K(i,j) = (P(i,:) - Q(j,:))'*(P(i,:) - Q(j,:))
% Runs on CPU only.

K = bsxfun(@plus,sum(p.^2,2),sum(q.^2,2)') - 2*(p*q');

end
Community
  • 1
  • 1
Alexey
  • 5,898
  • 9
  • 44
  • 81

2 Answers2

5

I find ArrayFire to be much faster and have started using it instead of the GPU kernels in OpenCV for image processing. Here are some benchmarks I found comparing ArrayFire (used to be in a different interface called LibJacket) to OpenCV and it's been true in my benchmarking too that ArrayFire is 2-4X faster than the GPU functions in OpenCV. From what I hear, NVIDIA didn't write the GPU kernels in OpenCV but contracted those out to someone, which may be why they are so slow. Since I'm only using 1 GPU, I can use ArrayFire for free.

Update, given the new MATLAB code posted by @Alex: I ran the benchmark of this code on my system. I get that the Parallel Computing Toolbox gpuArray is slower than the CPU, but Jacket and ArrayFire kick butt. HW specs are:

Intel(R) Xeon(R) CPU X5660  @ 2.80GHz
NVIDIA Tesla M2090

Results of CPU vs GPU using Parallel Computing Toolbox gpuArray (fully warmed up). CPU is faster than PCT gpuArray:

>> tic; sqEuclideanDist(gpuArray(rand(1581,3)),gpuArray(rand(189,3))); toc;
Elapsed time is 0.006859 seconds.
>> tic; sqEuclideanDist(rand(1581,3),rand(189,3)); toc;
Elapsed time is 0.005712 seconds.

Results of CPU vs GPU using Jacket (fully warmed up). Jacket beats PCT gpuArray by 3.7X and beats the CPU by 3X

>> tic; sqEuclideanDist(gdouble(rand(1581,3)),gdouble(rand(189,3))); toc;
Elapsed time is 0.001876 seconds.

Here is the modified code that let's you run all that easily:

function K = sqEuclideanDist(P,Q)
% Vectorized method to compute pairwise squared Euclidean distance on GPU
% Returns K(i,j) = (P(i,:) - Q(j,:))'*(P(i,:) - Q(j,:))

[nP, d] = size(P);
[nQ, d] = size(Q);

pmag = sum(P .* P, 2);
qmag = sum(Q .* Q, 2);

K = ones(nP,1)*qmag' + pmag*ones(1,nQ) - 2*P*Q';

end

Jacket does support BSXFUN on the GPU, and it does improve the speeds somewhat:

>> tic; sqEuclideanDist(gdouble(rand(1581,3)),gdouble(rand(189,3))); toc;
Elapsed time is 0.001420 seconds.

Note that the sizes used here are pretty small, so most CUDA code that attempts to run on these small sizes is likely to perform poorly. That's why I like to use AccelerEyes' stuff, because those guys have optimized the heck out of the GPU, unlike PCT gpuArray, Thrust, OpenCV, each of which I've tried in the past.

Here is the ArrayFire Free C++ results:

Time:  0.0003577 seconds
Speedups:  19.2X faster than PCT gpuArray, 16X faster than the CPU, 5.2X faster
than Jacket in MATLAB original version, 4X faster than Jacket in MATLAB using
BSXFUN

Here is the ArrayFire code I wrote for this:

static array SqEuclideanDist(array P, array Q)
{
    // 0 based indexing
    array pmag = sum(P * P, 1);
    array qmag = sum(Q * Q, 1);

    int np = P.dims(0);
    int nq = Q.dims(0);

    array K = tile(qmag.T(), np, 1) + tile(pmag, 1, nq) - 2 * matmul(P, Q.T());
    return K;
}

int main(int argc, char **argv)
{
    double *P_cpu = new double[1581 * 3];
    double *Q_cpu = new double[189 * 3];

    array P = array(1581, 3, P_cpu);
    array Q = array(189 , 3, Q_cpu);
    af::sync();

    int iter = 1000;

    timer::tic();
    for (int i = 0; i < iter; i++) {
        array K = SqEuclideanDist(P, Q);
        af::eval(K);
    }

    af::sync();
    printf("Time taken: %2.4lfms\n", (1000 * timer::toc()) / iter);

    delete[] P_cpu;
    delete[] Q_cpu;
}
mkln
  • 14,213
  • 4
  • 18
  • 22
Ben Stewart
  • 311
  • 1
  • 5
  • 1
    great job. Thanks for providing the alternatives. Definitely learned something today: didn't know about Jacket's support for bsxfun and I like the simple code of ArrayFire. The only thing is -- even though there is a free version of ArrayFire C++ library, the free version offers pretty limited functionality (for example it does not support linear algebra operations). I'm looking for an open source library, can you suggest any? – Alexey Jun 29 '12 at 19:12
  • You're welcome. Surprising how many people downvoted this post. Probably MathWorks employees. – Ben Stewart Jun 30 '12 at 00:47
  • There is unfortunately not an open source library that gives very good performance. That's why I've been using ArrayFire, cause at least it is free for what I need. Pretty much every function in ArrayFire is free, except for those that come from CULA, which is better than MAGMA for linear algebra stuff. But ArrayFire does have free single-precision linear algebra functions, which I use quite frequently. Would that work for you? BTW, the code you posted doesn't use those linear algebra features. – Ben Stewart Jun 30 '12 at 00:49
  • Yeah, I'm not sure about the negative votes, wish people explained their reasoning. I tried Matlab with Jacket for my application and it does offer better performance (about 3x speedup) over PCT. Going to see if I can squeeze some more performance using the free version of C++ ArrayFire. – Alexey Jul 03 '12 at 20:43
  • I'm not entirely sure what's going on but I tried a direct port of your ArrayFire code on R (using RcppArrayfire) and it doesn't output a distance matrix with 0 on the diagonal even on small matrices (so I dont think it's a math approximation error). perhaps there has been a change in how some functions work? – mkln Feb 10 '20 at 17:40
  • ^ found the error: the two tiled matrices must be summed not multiplied – mkln Feb 10 '20 at 19:51
1

They've been contributed by NVidia, so does have good performance on CUDA compatible cards. The real performance depends on the card itself and the function you are using.

In my experience only cvRotate and cvResize had a better performance than a normal Intel cpu. (Note: I was only interested in image related functions)

Mohammad
  • 1,253
  • 1
  • 10
  • 26