1

I implemented an algorithm on android using OpenCL and OpenMP. The OpenMP implementation runs about 10 times slower than the OpenCL one.

  • OpenMP: ~250 ms
  • OpenCL: ~25 ms

But overall, if I measure the time from the java android side, I get roughly the same time to call and get my values.

For example:

  • Java code:

    // calls C implementation using JNI (Java Native Interface)
    bool useOpenCL = true;
    myFunction(bitmap, useOpenCL); // ~300 ms, timed with System.nanoTime() here, but omitted code for clarity
    myFunction(bitmap, !useOpenCL); // ~300 ms, timed with System.nanoTime() here, but omitted code for clarity    
    
  • C code:

    JNIEXPORT void JNICALL Java_com_xxxxx_myFunctionNative(JNIEnv * env, jobject obj, jobject pBitmap, jboolean useOpenCL)
    {
    // same before, setting some variables
    
    clock_t startTimer, stopTimer;
    startTimer = clock();
    if ((bool) useOpenCL) {
       calculateUsingOpenCL(); // runs in ~25 ms, timed here, using clock()
    }
    else {
       calculateUsingOpenMP(); // runs in ~250 ms
    }
    stopTimer = clock();
    __android_log_print(ANDROID_LOG_VERBOSE, APPNAME, "Time in ms: %f\n", 1000.0f* (float)(stopTimer - startTimer) / (float)CLOCKS_PER_SEC);
    
    // same from here on, e.g.: copying values to java side
    }
    

The Java code, in both cases executes roughly in the same time, around 300 ms. To be more precise, elapsedTime is a bit more for OpenCL, that is OpenCL is slower on average.

Looking at the individual run-times of the OpenMP, and OpenCL implementations, OpenCL version should be much faster overall. But for some reason, there is an overhead that I cannot find.

I also compared OpenCL vs Normal native code (no OpenMP), I still got the same results, with roughly same runtime overall, even though the calculateUsingOpenCL ran at least 10 times faster.


Ideas:

  • Maybe the GPU (in OpenCL case) is less efficient in general, because it has less memory available. There are few variables that we need to preallocate, which are used every frame. So, we checked the time it takes for android to draw a bitmap in both cases (OpenMP, OpenCL). In the OpenCL case, sometimes drawing a bitmap took longer (3 times longer), but not by the amount that would equalize the overall run time of the program.

  • Does JNI use GPU to accelerate some calls, which could cause the OpenCL version to be slower?

EDIT:

  • Is it possible that Java Garbage collection is triggered by OpenCL, causing the big overhead?
Snowman
  • 1,503
  • 1
  • 17
  • 39
  • Which device/OS did you test on. Also which version of OpenMP library did you use (32 bit/64 bit)? – Morrison Chang Mar 27 '19 at 18:44
  • I tested on Open-Q 820 μSOM. OpenMP does not really matter here, because the question is about OpenCL being slow. – Snowman Mar 27 '19 at 18:55
  • "with OpenCL we need to allocate a few variables on it, which are used in every frame" - i'm not sure what you mean by "variables" but if you're allocating and releasing new buffers on every frame, you're almost certainly doing it wrong. My approach would be to have the buffers as static variables ("cl_mem" type), and add two methods "init" and "teardown" (self-explanatory). Then just use those buffers via static variables in every frame rendering. Same applies to OpenCL Command Queues - don't create & destroy them every frame; it can be slow. You want to reuse CL objects as much as possible. – mogu Mar 28 '19 at 07:58
  • No, we don't create variables every frame obviously. – Snowman Mar 28 '19 at 19:13

1 Answers1

0

It turns out, clock() is unreliable (on Android), so instead we used the following method to measure time. With this method, everything is ok.

int64_t getTimeNsec() {
    struct timespec now;
    clock_gettime(CLOCK_MONOTONIC, &now);
    return (int64_t) now.tv_sec*1000000000LL + now.tv_nsec;
}

clock_t startTimer, stopTimer;
startTimer = getTimeNsec();
    function_to_measure();
stopTimer = getTimeNsec();
__android_log_print(ANDROID_LOG_VERBOSE, APPNAME, "Runtime in milliseconds (ms): %f", (float)(stopTimer - startTimer) / 1000000.0f);

This was suggested here: How to obtain computation time in NDK

Snowman
  • 1,503
  • 1
  • 17
  • 39