I've made a CUDA program for 2D convolution and now want to compare it to some non-CUDA implementation to measure the speedup.
I could compare to my own implementation in plain C using the classical multiple loop approach or matlab's conv2 but it doesn't feel like a legit/fair comparison, since they're not the fastest implementations out there.
Also I was thinking of trying OpenCV and I've been looking for a SIMD optimized version with no luck. Any advice, should I go with OpenCV?
NOTE: I've read other questions, including this one, but the answer is basically the same as my plain C code or a discussion of the various methods available.