pyCUDA vs C performance differences?

Question

I'm new to CUDA programming and I was wondering how the performance of pyCUDA is compared to programs implemented in plain C. Will the performance be roughly the same? Are there any bottle necks that I should be aware of?

EDIT: I obviously tried to google this issue first, and was surprised to not find any information. i.e. I would have excepted that the pyCUDA people have this question answered in their FAQ.

Is finding benchmarks, or writing your own, to do a comparison out of the question? Seems like there should be some pyCUDA programs with corresponding CUDA-C programs; why not run them, time them, and see for yourself? Also, I'd find is surprisingly negligent on the part of the pyCUDA folks not to discuss the performance differences at least to some level of detail... CUDA is used for performance, and scripting languages are slower than full-fledged programming languages. Even if Python is popular among scientific computing people, you can't ignore the 900lb gorilla of performance. — Patrick87, Oct 28 '11 at 15:56
before I start writing my own benchmarks I would really love to hear what people with experience have to say. Like I said before, I'm new to CUDA so I don't really have a good feeling yet for what's possible and what to look out for. — memyself, Oct 28 '11 at 16:38

score 19 · Accepted Answer · answered Oct 28 '11 at 17:04

19

If you're using CUDA -- whether directly through C or with pyCUDA -- all the heavy numerical work you're doing is done in kernels that execute on the gpu and are written in CUDA C (directly by you, or indirectly with elementwise kernels). So there should be no real difference in performance in those parts of your code.

Now, the initialization of arrays, and any post-work analysis, will be done in python (probably with numpy) if you use pyCUDA, and that generally will be significantly slower than doing it directly in a compiled language (though if you've built your numpy/scipy in such a way that it links directly to high-performance libraries, then those calls at least would perform the same in either language). But hopefully, your initialization and finalization are small fractions of the total amount of work you have to do, so that even if there is significant overhead there, it still hopefully won't have a huge impact on overall runtime.

And in fact if it turns out that the python parts of the computation does hurt your application's performance, starting out doing your development in pyCUDA may still be an excellent way to get started, as the development is significantly easier, and you can always re-implement those parts of the code that are too slow in Python in straight C, and call those from python, gaining some of the best of both worlds.

answered Oct 28 '11 at 17:04

Jonathan Dursi

50,107
9
127
158

What about the cost of launching kernels? Is that expected to be the same in PyCUDA as in pure C? – user2398029 Nov 29 '12 at 00:23
Presumably the kernel launch is slower, as a line of python has to be interpreted/executed. – Jonathan Dursi Nov 29 '12 at 12:47
2

Yup yup. In fact it's up to an order of magnitude slower according to my benchmarks. Big performance hit for iterative algorithms where the actual GPU processing is very fast. – user2398029 Nov 29 '12 at 22:23
".. So there should be no real difference in performance in those parts of your code." --- makes no sense. Or maybe all of the computation that is done on a CPU should be the same speed just because it's done on a CPU? I haven't seen any talks about any compiler optimization in this discussion. And I have yet to find anywhere in the world any mentions of PyCUDA using O3, or O3 + FastMath optimizations, or having possibility of using them. As far as I've tried the example code, it compiled too darn fast for me to believe that it used nvcc with maxed optimizations. And they matter like A LOT. – Íhor Mé Aug 25 '16 at 22:43
1

@ÍhorMé : You can set the compiler options using the options discussed in, e.g., https://documen.tician.de/pycuda/driver.html#module-pycuda.compiler . – Jonathan Dursi Aug 26 '16 at 19:37

Peter Becich · Answer 2 · 2012-02-02T21:29:12.933

6

If you're wondering about performance differences by using pyCUDA in different ways, see SimpleSpeedTest.py included in the pyCUDA Wiki examples. It benchmarks the same task completed by a CUDA C kernel encapsulated in pyCUDA, and by several abstractions created by pyCUDA's designer. There's a performance difference.

edited Feb 02 '12 at 21:29

answered Dec 04 '11 at 10:24

Peter Becich

989
3
14
30

score 4 · Answer 3 · answered Oct 28 '11 at 19:18

I've been using pyCUDA for a little while an I like prototyping with it because it speeds up the process of turning an idea into working code.

With pyCUDA you will be writing the CUDA kernels using C++, and it's CUDA, so there shouldn't be a difference in performance of running that code. But there will be a difference in the performance of the code you write in Python to setup or use the results of the pyCUDA kernel vs the one you write in C.

score 2 · Answer 4 · answered Apr 21 '21 at 06:47

I was looking for an answer for the original question in this post and I see the problem Is deeper as I thought.

I my experience, I compared Cuda kernels and CUFFT's written in C with that written in PyCuda. Surprisingly, I found that, on my computer, the performance of suming, multiplying or making FFT's vary from each implentatiom. For example, I got almost the same performance in cuFFT for vector sizes until 2^23 elements. However, suming and multiplying complex vectors show some troubles. The speed up obtained in C/Cuda was ~6X for N=2^17, whilst in PyCuda only ~3X. It also depends on the way that the sumation was performed. By using SourceModule and wrapping the Raw Cuda code, I found the problem that my kernel, for complex128 vectors, was limitated for a lower N (<=2^16) than that used for gpuarray (<=2^24).

As a conclusion, is a good job testing and comparing the two sides of the problem and evaluate if it is convenient spend time in writing a Cuda script or gain readbility and pay the cost of a lower performance.

score 1 · Answer 5 · answered Sep 15 '16 at 00:59

Make sure you're using -O3 optimizations there and use nvprof/nvvp to profile your kernels if you're using PyCUDA and you want to get high performance. If you want to use Cuda from Python, PyCUDA is probably THE choice. Because interfacing C++/Cuda code via Python is just hell otherwise. You have to write a hell lot of ugly wrappers. And for numpy integration even more hardcore wrap-up code would be necessary.

pyCUDA vs C performance differences?

5 Answers5

Linked