Aren't NPP functions completely optimized?

Question

I developed a naive function for mirroring an image horizontally or vertically using CUDA C++.

Then I came to know that NVIDIA Performance Primitives Library also offers a function for image mirroring .

Just for the sake of comparison, I timed my function against NPP. Surprisingly, my function outperformed (although by a small margin, but still...).

I confirmed the results several times by using Windows timer, as well as CUDA Timer.

My question is that: Aren't NPP functions completely optimized for NVIDIA GPUs?

I'm using CUDA 5.0, GeForce GTX460M (Compute 2.1), and Windows 8 for development.

What was the difference, in percent? The mirroring operations will be memory bound and newer devices are flexible in which types of memory access patterns they will handle efficiently. A naive implementation may be close to optimal on newer devices. Maybe the NPP version works better for older devices. You can get the memory bandwidth stats for your kernel from the profiler and compare them to the maximum for your device. — Roger Dahl, Sep 14 '12 at 17:02
I tested on 4 types of images and 2 different sizes. 8 bit, 16 bit, 1 channel, 3 channel, (1280 x 720), (1920 x 1080). I got maximum speedup in 16 bit Single channel image of size (1280 x 720), which was 18.75 percent faster then NPP. — sgarizvi, Sep 14 '12 at 18:06
You are right that NPP's performance is lacking. I've found better libraries out there for doing CUDA image processing. I personally like ArrayFire's image processing selection and have found it to be fast, http://www.accelereyes.com/arrayfire/c/group__image__mat.htm Other people have reported using GPU features of OpenCV, though I've not heard great things about that. Tunacode in Pakistan has some stuff too. — Ben Stewart, Sep 14 '12 at 18:06
I thought that as NPP is made my NVIDIA itself, so it should be the fastest. — sgarizvi, Sep 14 '12 at 18:10
The same problem could be said of many SW packages that arise from HW companies. — Ben Stewart, Sep 14 '12 at 18:24
let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/16708/discussion-between-ben-stewart-and-sgar91) — Ben Stewart, Sep 14 '12 at 18:26

harrism · Accepted Answer · 2012-09-17T23:34:20.573

I risk getting no votes by posting this answer. :)

NVIDIA continuously works to improve all of our CUDA libraries. NPP is a particularly large library, with 4000+ functions to maintain. We have a realistic goal of providing libraries with a useful speedup over a CPU equivalent, that are are tested on all of our GPUs and supported OSes, and that are actively improved and maintained. The function in question (Mirror), is a known performance issue that we will improve in a future release. If you need a particular function optimized, your best way to get it prioritized is to file an RFE bug (Request for Enhancement) using the bug submission form that is available to NVIDIA CUDA registered developers.

As an aside, I don't think any library can ever be "fully optimized". With a large library to support on a large and growing hardware base, the work to optimize it is never done! :)

We encourage folks to continue to try and outdo NVIDIA libraries, because overall it advances the state of the art and benefits the computing ecosystem.

As an aside... it is probably safe to say that, with enough time and effort, it's generally possible to beat library functions in terms of raw performance. Libraries typically make fewer assumptions so that they are more widely applicable. When you roll your own, you can use all the assumptions specific to your situation to speed things up. One example that comes to mind (not GPGPU, but the same idea likely applies) is in sorting. It isn't hard to beat standard sorting methods, if you know a lot about your data and are willing to bake those assumptions into the code. — Patrick87, Feb 11 '13 at 17:25

Aren't NPP functions completely optimized?

1 Answers1