openmp vs opencl for computer vision

Question

I am creating a computer vision application that detect objects via a web camera. I am currently focusing on the performance of the application

My problem is in a part of the application that generates the XML cascade file using Haartraining file. This is very slow and takes about 6days . To get around this problem I decided to use multiprocessing, to minimize the total time to generate Haartraining XML file.

I found two solutions: opencl and (openMp and openMPI ) .

Now I'm confused about which one to use. I read that opencl is to use multiple cpu and GPU but on the same machine. Is that so? On the other hand OpenMP is for multi-processing and using openmpi we can use multiple CPUs over the network. But OpenMP has no GPU support.

Can you please suggest the pros and cons of using either of the libraries.

These are completely different technologies, for different purposes. — Oliver Charlesworth, Apr 07 '12 at 18:26
Why don't you read e.g. the Wikipedia pages. If you have a specific question, then that would be the time to post at Stack Overflow. — Oliver Charlesworth, Apr 07 '12 at 18:33

Andrew Tomazos · Accepted Answer · 2012-04-07T18:33:05.033

7

OpenCL is for using the GPU stream processors. http://en.wikipedia.org/wiki/Opencl

OpenMP is for using the CPU cores. http://en.wikipedia.org/wiki/Openmp

OpenMPI is for using a distributed network cluster. http://en.wikipedia.org/wiki/Openmpi

Which is best to use depends on your problem specification, but I would try using OpenMP first because it is the easiest to port a single threaded program onto it. Sometimes you can just put a pragma telling it to parellelize a main loop, and you can get speedups in the order of the number of CPU cores.

If your problem is very data parallel and floating pointish - than you can get better performance out of GPU - but you have to write a kernel in a C-like language and map or read/write memory buffers between the host and GPU. Its a hassle, but performance gains in some cases can be on the order of 100 as GPUs are specifically designed for data parallel work.

OpenMPI will get you the most performance but you need a cluster (a bunch of servers on the same network), and they are expensive.

edited Apr 07 '12 at 18:33

answered Apr 07 '12 at 18:25

Andrew Tomazos

66,139
40
186
319

you say : OpenCL is for using the GPU , is that mean opencl not using cpu at all – user1235711 Apr 07 '12 at 18:27
2

No, OpenCL can use CPU too, but if you are using the CPU it is better to use OpenMP. – Andrew Tomazos Apr 07 '12 at 18:28
@AndrewTomazos-Fathomling, it is not necessarily so - OpenCL compilers for the CPU, like the one from Intel, for example, automatically vectorize the CPU code, which OpenMP does not. Also, since OpenCL programs can be compiled at run time, certain optimizations can be made by the compiler based on the run time context, which is not possible in a statically compiled program. See this for one perspective on this: http://stackoverflow.com/q/7126611/677131 – Lubo Antonov Apr 08 '12 at 11:21
@lucas1024: Speaking purely from experience, when I've tried to optimize something by using OpenMP (or plain old threads) and then optimize by translating it into a kernel and using OpenCL on the CPU Device (with both Intel Platform on Intel OCL SDK 1.5, and AMD OCL Platform for CPU), I have not noticed significant differences in performance between OpenMP and OpenCL on the CPU (only on GPU). As OpenMP is easier to use it wins. Hence my conclusion that OpenCL is only recommended for using GPU. There are a lot of things that could theoretically impact performance, the best thing is to test. – Andrew Tomazos Apr 08 '12 at 11:55
100 gain for the GPU. That's nonsense! That's comparing unoptimized single threaded code from the CPU to optimized GPU code (and for double floating point no consumer GPU is better per USD). Optimized single floating point code on the CPU will be within a factor of 5 of the GPU per USD. I recently optimized a [cholesky](http://stackoverflow.com/questions/22479258/cholesky-decomposition-with-openmp/23063655#23063655) decomposition solver with OpenMP and SIMD which is 40x faster than the Java BLAS library we were using. – Z boson Apr 24 '14 at 10:39
@Zboson: Sure, I did say "can be as much as 100x", as in an upper limit. Typically people characterize it with the ballpark 10x-100x. It's very hard to make a fair comparison because it depends on the data and algorithm, and in general we don't know how to produce an optimal program for a given machine, and even if we did we don't know how to prove it is optimal. For large matrix multiplication I have observed first-hand 15x difference between OpenCL/CUDA GPU and an optimized CPU BLAS. – Andrew Tomazos Apr 25 '14 at 08:02
GEMM is a good example. Compare peak flops for the CPU and the GPU. Then compare the cost of each. What was the 15x hardware? What was the efficiency of the GEMM on the CPU and GPU? I bet once you compare costs that 15x is going to be closer to 5x for single floating point and less than 2x for double. – Z boson Apr 25 '14 at 08:20
@Zboson: It was a 4096x4096 gemm. The CPU was Intel(R) Core(TM) i7-3930K CPU. The GPU was AMD Radeon HD 7970 3GB. I guess they cost fairly similar actually, about ~500 USD. – Andrew Tomazos Apr 25 '14 at 11:09
The 7970 has a peak SP GFLOPS of 4096 and 1024 DP GFLOPS the 3930k is 256 SP GFLOPS and 128 DP GFLOPS. So from peak values the 7970 is 16x faster for SP and 8x faster for double precision. That probably explains the 15x you are seeing (SP?). But I think Haswell is a better comparison to the 7970. The 4770 is about $300 and doubles the peak values so 7970 is 8x better fro SP and 4x for double. That's closer to the numbers I mentioned. Keep in mind that AMD has recently started following Nvidia and is crippling their DP performance on consumer cards so DP will have no advantage anymore. – Z boson Apr 25 '14 at 11:33
@Zboson: It's extremely hard to get even vaguely close to peak flops out of hardware, so its not just a matter of dividing the peaks. There are big differences in both the interface and internal architecture. The comparison is much harder. – Andrew Tomazos May 10 '14 at 13:35

score 0 · Answer 2 · answered May 15 '12 at 15:55

Could the performance problem be in the XML file itself?

Have you tried to use a different, lighter file format?

I think that an XML file that takes 6 days to be generated must be quite long and complex. If you have control on this data format, try Google's Protocol Buffers.

Before digging into OpenMP, OpenCL or whatever, check how much time is spent accessing the hard disk; if that is the issue, the parallel libraries won't improve things.

score -3 · Answer 3 · edited Oct 07 '20 at 00:02

-3

research opencv and see if it might help.

edited Oct 07 '20 at 00:02

10 Rep

2,217
7
19
33

answered Apr 24 '14 at 02:30

nikk

2,627
5
30
51

openmp vs opencl for computer vision

3 Answers3