I've been writing an OpenCL program that I first wrote on my Macbook Pro and since my desktop computer is stronger i wanted to port the code and see if there is any improvement.
The same code ran for:
Mac: 0.055452s
Win7 :0.359s
The specifications of both computers are: Mac : 2.6GHz Intel Core i5, 8GB 1600MHz DDR3, Intel Iris 1536MB
PC : 3.3GHz Intel Core i5-2500k, 8GB 1600MHz DDR3, AMD Radeon HD 6900 Series
Now as you can see the code ran on my Mac almost 10x faster than on my desktop PC.
I timed the code using
#include<ctime>
clock_t begin = clock();
....// Entire main file
float timeTaken = (float)(clock() - begin) / CLOCKS_PER_SEC;
cout << "Time taken: " << timeTaken << endl;
If I am not mistaken, both the CPU and the GPU are stronger on the PC. I was able to run Battlefield 3 on Ultra settings with this desktop computer.
Only difference might being that Visual Studio on the PC compiles with another compiler? I used g++ on my mac, not sure what Visual Studio uses.
These results don't make sense to me. What do you guys think? If you want to check out the code I can post the github link
EDIT: The following github link shows the code https://github.com/Batkow/OpenCL/tree/master/OpenCL . PSO_V2 uses the type of coding used in the tutorial from : https://www.fixstars.com/en/opencl/book/OpenCLProgrammingBook/introduction-to-parallelization/
And PSO simplifies the coding using the custom headers from this github repo: https://github.com/HandsOnOpenCL/Exercises-Solutions ..
I ran the code on my friends new i7 laptop with an NVidia Geforce 950M and the code was executed even slower than on my desktop PC.
I do realize that the code isn't optimized so any hints on stupid stuff I do, please call out on it. For instance having a while loop across three different kernel functions is kind of stupid right? I'm working on to implement all of it inside a kernel and loop inside of it, which should improve performance?
UPDATE: Ran the OpenCL/PSO code on the windows at home again. Timing the code before and after the while loop gives WINDOWS a faster performance yay!
clock_t Win7 = 0.027 and Mac = 0.036. Using the external .hpp with the Util::Timer class Win7 ran on :0.026s while Mac on 0.085s.
Timing from the start of the main file to right before the while loop (all of the initializations) then Mac scored better than Windows almost by 10 times using both clock_t and the Util::Timer. So the bottleneck seems to be at the initialization of the device?