I have attended a few weeks ago an Arrayfire webinar hosted by NVIDIA, and the engineers presented some interesting graphs to compare the performance of the ArrayFire library with OpenCV CPU (1 Thread) & GPU (CUDA).
HARRIS Keypoint detection
ORB Keypoint detection
I had the opportunity to ask them why did the ArrayFire speedup (over single-threaded CPU implementation) decreases for large images. They answered me that "it was due to the fact OpenCV CPU was processing large scale data very efficiently" - Without giving any technical details.
Do you have an idea on what they might be?