OpenCL slower on desktop computer than macbook 13"

Question

I've been writing an OpenCL program that I first wrote on my Macbook Pro and since my desktop computer is stronger i wanted to port the code and see if there is any improvement.

The same code ran for:

Mac: 0.055452s

Win7 :0.359s

The specifications of both computers are: Mac : 2.6GHz Intel Core i5, 8GB 1600MHz DDR3, Intel Iris 1536MB

PC : 3.3GHz Intel Core i5-2500k, 8GB 1600MHz DDR3, AMD Radeon HD 6900 Series

Now as you can see the code ran on my Mac almost 10x faster than on my desktop PC.

I timed the code using

#include<ctime>
clock_t begin = clock();
....// Entire main file
float timeTaken = (float)(clock() - begin) / CLOCKS_PER_SEC;
cout << "Time taken: " << timeTaken << endl;

If I am not mistaken, both the CPU and the GPU are stronger on the PC. I was able to run Battlefield 3 on Ultra settings with this desktop computer.

Only difference might being that Visual Studio on the PC compiles with another compiler? I used g++ on my mac, not sure what Visual Studio uses.

These results don't make sense to me. What do you guys think? If you want to check out the code I can post the github link

EDIT: The following github link shows the code https://github.com/Batkow/OpenCL/tree/master/OpenCL . PSO_V2 uses the type of coding used in the tutorial from : https://www.fixstars.com/en/opencl/book/OpenCLProgrammingBook/introduction-to-parallelization/

And PSO simplifies the coding using the custom headers from this github repo: https://github.com/HandsOnOpenCL/Exercises-Solutions ..

I ran the code on my friends new i7 laptop with an NVidia Geforce 950M and the code was executed even slower than on my desktop PC.

I do realize that the code isn't optimized so any hints on stupid stuff I do, please call out on it. For instance having a while loop across three different kernel functions is kind of stupid right? I'm working on to implement all of it inside a kernel and loop inside of it, which should improve performance?

UPDATE: Ran the OpenCL/PSO code on the windows at home again. Timing the code before and after the while loop gives WINDOWS a faster performance yay!

clock_t Win7 = 0.027 and Mac = 0.036. Using the external .hpp with the Util::Timer class Win7 ran on :0.026s while Mac on 0.085s.

Timing from the start of the main file to right before the while loop (all of the initializations) then Mac scored better than Windows almost by 10 times using both clock_t and the Util::Timer. So the bottleneck seems to be at the initialization of the device?

try to read this: [link](http://stackoverflow.com/questions/21134279/difference-in-performance-between-msvc-and-gcc-for-highly-optimized-matrix-multp) — Incomputable, Feb 07 '16 at 18:08
Please post the link to the code. It's not clear exactly what you're timing from that code snippet - are you also timing device initialisation, runtime compilation, device memory allocation/transfer etc? These will add large overheads to your timings, which will vary drastically between platforms. — jprice, Feb 07 '16 at 18:38

score 1 · Answer 1 · answered Feb 07 '16 at 18:08

Could be dozens of things - what the CL kernel does would be key, and how well that works on different types of GPUs. Or what compiler is used

However, I think the problem is how you are measuring time. clock() on Windows measure "wallclock-time" (in other words "elapsed time"), on OSX (and all other sane OS's), it reports CPU time for your process. If the OSX runs on the graphics processor [or in a separate process], it won't count as CPU-time, where Windows measures the overall time.

Either measure the CPU time using an appropriate CPU time measurement on Windows (using GetProcessTimes for example). Or use c++ std::chrono to measure wall-clock-time in both places.

segevara · Answer 2 · 2016-02-08T05:58:39.307

Maybe the problem with way you measuring of elapsing time for example I made three different way to do so in my project for different OS:

#include <cstdio>
#include <iostream>

#if defined(_WIN32) || defined(_WIN64)
#include <windows.h>
#elif defined(__linux__)
#include <time.h>
#elif defined(__APPLE__)
#include <mach/mach.h>
#include <mach/mach_time.h>
#include <stddef.h>
#endif
#if defined(_WIN32) || defined(_WIN64)
LARGE_INTEGER frequency;  // ticks per second
LARGE_INTEGER t0, t1, t2; // ticks
LARGE_INTEGER t3, t4;
#elif defined(__linux__)
timespec t0, t1, t2;
timespec t3, t4;
double us;
#elif defined(__APPLE__)
unsigned long t0, t1, t2;
#endif
double elapsedTime;
void refreshTime() {
#if defined(_WIN32) || defined(_WIN64)
  QueryPerformanceFrequency(&frequency); // get ticks per second
  QueryPerformanceCounter(&t1);          // start timer
  t0 = t1;
#elif defined(__linux__)
  clock_gettime(CLOCK_MONOTONIC_RAW, &t1);
  t0 = t1;
#elif defined(__APPLE__)
  t1 = mach_absolute_time();
  t0 = t1;
#endif
}

void watch_report(const char *str) {
#if defined(_WIN32) || defined(_WIN64)
  QueryPerformanceCounter(&t2);
  printf(str, (t2.QuadPart - t1.QuadPart) * 1000.0 / frequency.QuadPart);
  t1 = t2;
  elapsedTime = (t2.QuadPart - t0.QuadPart) * 1000.0 / frequency.QuadPart;
#elif defined(__linux__)
  clock_gettime(CLOCK_MONOTONIC_RAW, &t2);
  time_t sec = t2.tv_sec - t1.tv_sec;
  long nsec;
  if (t2.tv_nsec >= t1.tv_nsec) {
    nsec = t2.tv_nsec - t1.tv_nsec;
  } else {
    nsec = 1000000000 - (t1.tv_nsec - t2.tv_nsec);
    sec -= 1;
  }
  printf(str, (float)sec * 1000.f + (float)nsec / 1000000.f);
  t1 = t2;
  elapsedTime = (float)(t2.tv_sec - t0.tv_sec) * 1000.f +
                (float)(t2.tv_nsec - t0.tv_nsec) / 1000000.f;
#elif defined(__APPLE__)
  uint64_t elapsedNano;
  static mach_timebase_info_data_t sTimebaseInfo;

  if (sTimebaseInfo.denom == 0) {
    (void)mach_timebase_info(&sTimebaseInfo);
  }

  t2 = mach_absolute_time();
  elapsedNano = (t2 - t1) * sTimebaseInfo.numer / sTimebaseInfo.denom;
  printf(str, (float)elapsedNano / 1000000.f);
  t1 = t2;
  elapsedNano = (t2 - t0) * sTimebaseInfo.numer / sTimebaseInfo.denom;
  elapsedTime = (float)elapsedNano / 1000000.f;
#endif
}
/*This Function will work till you press q*/
void someFunction() {
  while (1) {
    char ch = std::cin.get();
    if (ch == 'q')
      break;
  }
}

int main() {
  refreshTime();
  someFunction();
  watch_report("some function was working: \t%9.3f ms\n");
}

DIdnt get your code to work, somethign missed.. but used the code from one of the answers here : http://stackoverflow.com/questions/17432502/how-can-i-measure-cpu-time-and-wall-clock-time-on-both-linux-windows ... Mac got walltime oof 0.085 and CPU time of 0.051 , while PC got wall time: 0.354 and CPU time : 0.203s — Batko, Feb 07 '16 at 21:15
@Batko my fault I've just fixed code add missing headers and someFunction for test seems that work to me on linux. I can say that I run this function from my project on Mac, Windows and Linux and I had more or less comparable result. — segevara, Feb 08 '16 at 06:05

score 1 · Answer 3 · answered Feb 10 '16 at 08:44

Its because clock() function just counts the CPU clock cycles. Your host code could be invoking the kernel and then going to sleep. And this sleep time wont be counted by the clock function even if you kernel is executing. So the what this means is that the clock function considers only the execution time of the host code and not the openCL kernel. You need to use a function that counts the wall clock time rather than the CPU clock cycles.

score 0 · Answer 4 · answered Feb 07 '16 at 18:07

0

Visual Studio will use MSVC compiler unless specified directly.

I think that the answer is hidden in you CPU's generations. On you PC it is Sandy Bridge (2nd gen) and on Mac - Haswell (4th gen). It is 2 generations difference.

OpenCL is one of the things that evolved significantly during these generations of Intel processors (excessive hardware support in Haswell).

Just to get the proof - find a friend with desktop equipped with Haswell CPU and run your tests. Desktop Haswell processor should beat your Mac's one (of course if the other hardware specs and the overall system load will match).

answered Feb 07 '16 at 18:07

nickolay

3,643
3
32
40

is there any way to confirm this from my end? – Batko Feb 07 '16 at 18:26
Just tried the benchmark. Win7 gave for CPU+GPU a score of :4307 and GPU only :3178.. Whereas Mac gave for CPU+GPU : 1602 and 1660 for GPU only.. Hm so hardware doesn't seem to be the issue? I noticed that the LuxMark said that the Platform versions are 1.2 for both CPU and GPU on the macbook while Win7 got OpenCL 2.0 AMD-APP (1800.8) – Batko Feb 07 '16 at 21:03
@Batko I would not be 100% sure that these numbers define everything as benchmark uses combination of different test types and provide us finally with synthetic values. Could you give a link to the code? – nickolay Feb 08 '16 at 01:26
Hi! I edited the post with the github links so they're up now : https://github.com/Batkow/OpenCL/tree/master/OpenCL – Batko Feb 08 '16 at 17:14

score 0 · Answer 5 · answered Feb 08 '16 at 13:34

0

You've posted your timing functions but not the OpenCL code. What all is being timed? A big factor might be the kernel compilation time (clCreateProgramFromSource and clBuildProgram). Your PC might be using cached kernels while your Mac isn't. The proper way to measure OpenCL kernel execution time is to use OpenCL events.

answered Feb 08 '16 at 13:34

Dithermaster

6,223
1
12
20

As you mentioned, the buillding of the program seems to be what takes the most of the time for me on the windows platform. But also the writing buffers is a bit slower as well. Other than that, it seems to have the same run-time. I also noticed that the AMD SDK says it has platform OpenCL 2.0.. however my GPU isn't compatible with 2.0.. And I can't find online any AMD SDK for opencl 1.2 neither – Batko Feb 08 '16 at 19:30

score -1 · Answer 6 · answered Feb 07 '16 at 18:45

-1

Another possible reason why you might be having that results — maybe your program compiles to OpenCL ver.2.0.

On Windows, your GPU is from 2010, it only supports OpenCL 1.2.

On OSX, your Intel GPU supports OpenCL 2.0.

answered Feb 07 '16 at 18:45

Soonts

20,079
9
57
130

I did a benchmark for the mac and windows, as you can see in another comment, and the Windows got much better score. It said that the opencl on the Win7 was 2.0 AMD-APP but for the mac only opencl 1.2 – Batko Feb 07 '16 at 21:06
Your AMD GPU only supports 1.2: https://en.wikipedia.org/wiki/List_of_AMD_graphics_processing_units#Radeon_HD_6xxx_Series If your program uses OpenCL 2.0 features, like shared virtual memory/nested parallelism/atomics/pipes/etc, OpenCL will happily run your program on Intel GPU on Mac, but won’t be able to on your AMD GPU on Windows. – Soonts Feb 07 '16 at 22:11
I did a test run of example01 from https://github.com/HandsOnOpenCL/Exercises-Solutions/tree/master/Solutions/Exercise05/Cpp which prints out the device info.. And my cpu and gpu on the mac support only version 1.2... – Batko Feb 08 '16 at 19:34

OpenCL slower on desktop computer than macbook 13"

6 Answers6