3

It's my first time using openCL on ARM(CPU:Qualcomm Snapdragon MSM8930, GPU:Adreno(TM)305).

I find using openCL is really very effective, but data exchanging between CPU and GPU takes too much time, as much as I can't imaging.

Here is an example:

cv::Mat mat(640,480,CV_8UC3,cv::Scalar(0,0,0));
cv::ocl::oclMat mat_ocl;

//cpu->gpu
mat_ocl.upload(mat);
//gpu->cpu
mat = (cv::Mat)mat_ocl;

Just a small image like this, the upload option takes 10ms, and download option takes 20ms! That takes too long.

Can anyone could tell me is this situation normal? Or something goes wrong here?

Thank you in advance!

added:

my messuring method is

clock_t start,end;
start=clock();
mat_ocl.upload(mat);
end = clock();
__android_log_print(ANDROID_LOG_INFO,"tag","upload time = %f s",(double)(end-start)/CLOCKS_PER_SEC);

Actually, I'm not using openCL exactly, but ocl module in openCV(although it says they are equal). When reading openCV documents, I find it's just tell us to transform cv::Mat to cv::ocl::oclMat (which is data uploading from CPU to GPU)to do GPU calculation, but I haven't found memory mapping method in the ocl module documents.

rossi_lhf
  • 103
  • 7
  • How are you measuring the upload/download times? – unixsmurf Mar 21 '15 at 11:57
  • Whats your hardware bandwidth for transfers? – huseyin tugrul buyukisik Mar 22 '15 at 10:13
  • I haven't found the hardware bandwidth parameter yet, I can only say the processor is Qualcomm Snapdragon MSM8930,GPU is Adreno(TM)305. – rossi_lhf Mar 23 '15 at 02:24
  • rossi_lhf, what is your OS? `clock()` is usually least precise (on some kernels configured to round all reported times to 10 ms), try measuring time with `gettimeofday()` or `clock_gettime()` - check http://stackoverflow.com/questions/8594277/clock-precision-in-time-h. You also can measure time of several copy operations (loop of 10 uploads, or 50 uploads) – osgx Mar 23 '15 at 02:29
  • my os is android 4.2.2. Actually, I do measure time of several copy operations, so the time 10ms and 20ms is average time. – rossi_lhf Mar 23 '15 at 02:52

2 Answers2

2

Well, I found some useful introductions in openCV doc:

In a heterogeneous device environment, there may be cost associated with data transfer. This would be the case, for example, when data needs to be moved from host memory (accessible to the CPU), to device memory (accessible to a discrete GPU). in the case of integrated graphics chips, there may be performance issues, relating to memory coherency between access from the GPU “part” of the integrated device, or the CPU “part.” For best performance, in either case, it is recommended that you do not introduce data transfers between CPU and the discrete GPU, except in the beginning and the end of the algorithmic pipeline.

So, it seems explain the reason why speed of data transfer between CPU and GPU is so slow. But I still don't know how to fix this issue.

rossi_lhf
  • 103
  • 7
1

Provide exact measuring methods and results.

From experience of OpenCL development under ARM platforms (not Qcom, though), I can say that you shouldn't expect much of read-write operations. Memory bus is usually like 64bit, plus DDR3 isn't that fast.

Use shared memory for your advantage - go for mapping/unmapping instead of read/write.

P. S. actual operation time is measured, using cl_event profiling:

cl_ulong getTimeNanoSeconds(cl_event event)
{
    cl_ulong start = 0, end = 0;

    cl_int ret = clWaitForEvents(1, &event);
    if (ret != CL_SUCCESS)
        throw(ret);

    ret = clGetEventProfilingInfo(
              event,
              CL_PROFILING_COMMAND_START,
              sizeof(cl_ulong),
              &start,
              NULL);
    if (ret != CL_SUCCESS)
        throw(ret);

    ret = clGetEventProfilingInfo(
              event,
              CL_PROFILING_COMMAND_END,
              sizeof(cl_ulong),
              &end,
              NULL);
    if (ret != CL_SUCCESS)
        throw(ret);

    return (end - start);
}
Roman Arzumanyan
  • 1,784
  • 10
  • 10
  • my measuring method is: 'clock_t start, end;‘ ’start = clock();‘ 'mat_ocl.upload(mat);' 'end = clock();' '__android_log_print(ANDROID_LOG_INFO,"tag","upload time = %f s",(double)(end-start)/CLOCKS_PER_SEC);' – rossi_lhf Mar 23 '15 at 01:43
  • for 4 images size of 640x480xCV_8UC3, upload time is 0.0569s, and download time is 0.1137s – rossi_lhf Mar 23 '15 at 01:49
  • You are measuring whole execution time of OpenCV function, which is wrapped around OpenCL clEnqueueReadBuffer (clEnqueueReadImage) functions. Actual IO time can be measured by producing cl_event by beforementioned functions & getting profile info – Roman Arzumanyan Mar 23 '15 at 07:30
  • My time measuring method is indeed not accuracy, but according to the opencv doc introduction(I pasted in answer2),I think the main problem may be the read/write speed in integrated chips. So I should try share memory method. Thank you for you advice! – rossi_lhf Mar 23 '15 at 08:06
  • 1
    If you have enough time, I strongly advice you to go deeper into OpenCV classes and implement ability to go for memory mapping - unmapping instead of read - write, it will bring the sense to OpenCL development for mobile. – Roman Arzumanyan Mar 23 '15 at 10:00