I have parallelized an already existing code for computer vision applications using OpenMP. I think that I well designed it because:
- The workload is well-balanced
- There is no synchronization/locking mechanism
- I parallelized the outer most loops
- All the cores are used for most of the time (there are no idle cores)
- There is enough work for each thread
Now, the application doesn't scale when using many cores, e.g. it doesn't scale well after 15 cores.
The code uses external libraries (i.e. OpenCV and IPP) where the code is already optimized and vectorized, while I manually vectorized some portions of the code as best as I could. However, according to Intel Advisor, the code isn't well vectorized, but there is no much left to do: I already vectorized the code where I could and I can't improve the external libraries.
So my question is: is it possible that vectorization is the reason why the code doesn't scale well at some point? If so, why?