My opencl test does not run much faster than CPU

Question

I am trying to measure the execution time of GPU and compare it with CPU. I wrote a simple_add function to add all elements of a short int vector. The Kernel code is:

global const int * A, global const uint * B, global int* C)
    {
        ///------------------------------------------------
        /// Add 16 bits of each
        int AA=A[get_global_id(0)];
        int BB=B[get_global_id(0)];
        int AH=0xFFFF0000 & AA;
        int AL=0x0000FFFF & AA;
        int BH=0xFFFF0000 & BB;
        int BL=0x0000FFFF & BB;
        int CL=(AL+BL)&0x0000FFFF;
        int CH=(AH+BH)&0xFFFF0000;      
        C[get_global_id(0)]=CH|CL;               
     }

I wrote another CPU version for this function and after 100 time executions measured their execution time

clock_t before_GPU = clock();
for(int i=0;i<100;i++)
{
  queue.enqueueNDRangeKernel(kernel_add,1,
  cl::NDRange((size_t)(NumberOfAllElements/4)),cl::NDRange(64));
  queue.finish();
 }
 clock_t after_GPU = clock();


 clock_t before_CPU = clock();
 for(int i=0;i<100;i++)
     AddImagesCPU(A,B,C);
  clock_t after_CPU = clock();

the result was as below after 10 times calling the whole measurement function:

        CPU time: 1359
        GPU time: 1372
        ----------------
        CPU time: 1336
        GPU time: 1269
        ----------------
        CPU time: 1436
        GPU time: 1255
        ----------------
        CPU time: 1304
        GPU time: 1266
        ----------------
        CPU time: 1305
        GPU time: 1252
        ----------------
        CPU time: 1313
        GPU time: 1255
        ----------------
        CPU time: 1313
        GPU time: 1253
        ----------------
        CPU time: 1384
        GPU time: 1254
        ----------------
        CPU time: 1300
        GPU time: 1254
        ----------------
        CPU time: 1322
        GPU time: 1254
        ----------------

The problem is that I really expected GPU to be much faster than CPU but it was not. I can't understand why my GPU speed is not much higher than CPU. Is there any problem in my codes ?? Here is my GPU properties:

        -----------------------------------------------------
        ------------- Selected Platform Properties-------------:
        NAME:   AMD Accelerated Parallel Processing
        EXTENSION:      cl_khr_icd cl_amd_event_callback cl_amd_offline_devices cl_khr_d3d10_sharing
        VENDOR:         Advanced Micro Devices, Inc.
        VERSION:        OpenCL 1.2 AMD-APP (937.2)
        PROFILE:        FULL_PROFILE
        -----------------------------------------------------
        ------------- Selected Device Properties-------------:
        NAME :  ATI RV730
        TYPE :  4
        VENDOR :        Advanced Micro Devices, Inc.
        PROFILE :       FULL_PROFILE
        VERSION :       OpenCL 1.0 AMD-APP (937.2)
        EXTENSIONS :    cl_khr_gl_sharing cl_amd_device_attribute_query cl_khr_d3d10_sharing
        MAX_COMPUTE_UNITS :     8
        MAX_WORK_GROUP_SIZE :   128
        OPENCL_C_VERSION :      OpenCL C 1.0
        DRIVER_VERSION:         CAL 1.4.1734
        ==========================================================

and just to compare this is my CPU specifications:

        ------------- CPU Properties-------------:
        NAME :          Intel(R) Core(TM) i3-2100 CPU @ 3.10GHz
        TYPE :  2
        VENDOR :        GenuineIntel
        PROFILE :       FULL_PROFILE
        VERSION :       OpenCL 1.2 AMD-APP (937.2)
        MAX_COMPUTE_UNITS :     4
        MAX_WORK_GROUP_SIZE :   1024
        OPENCL_C_VERSION :      OpenCL C 1.2
        DRIVER_VERSION:         2.0 (sse2,avx)
        ==========================================================

I also measured the wall clock time using QueryPerformanceCounter and here is the results:

            CPU time: 1304449.6  micro-sec
            GPU time: 1401740.82  micro-sec
            ----------------------
            CPU time: 1620076.55  micro-sec
            GPU time: 1310317.64  micro-sec
            ----------------------
            CPU time: 1468520.44  micro-sec
            GPU time: 1317153.63  micro-sec
            ----------------------
            CPU time: 1304367.29  micro-sec
            GPU time: 1251865.14  micro-sec
            ----------------------
            CPU time: 1301589.17  micro-sec
            GPU time: 1252889.4  micro-sec
            ----------------------
            CPU time: 1294750.21  micro-sec
            GPU time: 1257017.41  micro-sec
            ----------------------
            CPU time: 1297506.93  micro-sec
            GPU time: 1252768.9  micro-sec
            ----------------------
            CPU time: 1293511.29  micro-sec
            GPU time: 1252019.88  micro-sec
            ----------------------
            CPU time: 1320753.54  micro-sec
            GPU time: 1248895.73  micro-sec
            ----------------------
            CPU time: 1296486.95  micro-sec
            GPU time: 1255207.91  micro-sec
            ----------------------

Again I tried the opencl profiling for execution time.

            queue.enqueueNDRangeKernel(kernel_add,1,
                                    cl::NDRange((size_t)(NumberOfAllElements/4)),
                                    cl::NDRange(64),NULL,&ev);
            ev.wait();
            queue.finish();
            time_start=ev.getProfilingInfo<CL_PROFILING_COMMAND_START>();
            time_end=ev.getProfilingInfo<CL_PROFILING_COMMAND_END>();

Results for one time execution were more or less the same:

            CPU time: 13335.1815  micro-sec
            GPU time: 11865.111  micro-sec
            ----------------------
            CPU time: 13884.0235  micro-sec
            GPU time: 11663.889  micro-sec
            ----------------------
            CPU time: 19724.7296  micro-sec
            GPU time: 14548.222  micro-sec
            ----------------------
            CPU time: 19945.3199  micro-sec
            GPU time: 15331.111  micro-sec
            ----------------------
            CPU time: 17973.5055  micro-sec
            GPU time: 11641.444  micro-sec
            ----------------------
            CPU time: 12652.6683  micro-sec
            GPU time: 11632  micro-sec
            ----------------------
            CPU time: 18875.292  micro-sec
            GPU time: 14783.111  micro-sec
            ----------------------
            CPU time: 32782.033  micro-sec
            GPU time: 11650.444  micro-sec
            ----------------------
            CPU time: 20462.2257  micro-sec
            GPU time: 11647.778  micro-sec
            ----------------------
            CPU time: 14529.6618  micro-sec
            GPU time: 11860.112  micro-sec

`clock()` measures CPU time instead of wall clock time. It won't count in GPU running time. The time you measures is probably taken by OpenCL API calls. Try `clock_gettime()` in C or `std::chrono::steady_clock` in C++ instead. You didn't mention the unit of your "CPU time". If it's raw output of `clock()` function (which must be divided by `CLOCKS_PER_SEC` to get the number of seconds), 1200 is really a short period. — cuihao, Feb 21 '17 at 06:46
See [this answer](http://stackoverflow.com/a/29972269/1231073) for OpenCL kernel time measurement. — sgarizvi, Feb 21 '17 at 07:38
Since I was comparing two execution times, I think it does not matter to use CPU time or wall-clock time. However, I tried to measure wall clock time in microseconds and adding this measurement. — Afshin, Feb 21 '17 at 07:42
Try to use memory mapping for buffers, don't copy. Also its an old device — huseyin tugrul buyukisik, Feb 21 '17 at 08:21
@ sgarizvi !! ... I tried the opencl profiling results were almost the same. — Afshin, Feb 21 '17 at 08:23
@ huseyin .. I did not consider the copy time in measurement. I guss after copy, the gpu must perform faster comparing memory mapping (I'm not sure about this yet) — Afshin, Feb 21 '17 at 08:28
The kernel code is memory limited, I doubt you will be able to optimize it much. OpenCL is not targeted for this kind of workloads. If this operation is a pre/post-phase of some other math you should write that math in the kernel, instead of just the bit mix step. — DarkZeros, Feb 21 '17 at 14:59
It's worth noting that GPU architectures are usually optimized for significant amounts of floating point operations, with little attention paid to integer operations. I've had exotic workloads finish significantly faster on the CPU than on the GPU when the GPU is old enough and the workload involved a lot of integer operations. — Xirema, Feb 21 '17 at 20:23
Yes you are right. I did the test with some floating point operations and got faster results. — Afshin, Feb 22 '17 at 05:25
@DarkZeros ... I know that the kernel code could be optimized but the when did that result was the same. After some other tests I realized that floating point operations are faster on GPU than CPU. — Afshin, Feb 22 '17 at 05:31
I read somewhere that 4 thinner cores can do integer add and the single thicker core can do integer multiplication division but it doesn't say which one is doing "and" "or" operations (I mean vliw part) — huseyin tugrul buyukisik, Feb 22 '17 at 10:35

huseyin tugrul buyukisik · Answer 1 · 2017-02-21T09:56:14.353

1

ATI RV730 has VLIW structure so it is better to try uint4 and int4 vector types with 1/4 number of total threads (which is NumberOfAllElements/16). This would also help loading from memory faster for each work item.

Also kernel doesn't have much calculations compared to memory operations. Making buffers mapped to RAM would have better performance. Don't copy arrays, map them to memory using map/unmap enqueue commands.

If its still not faster, you can use both gpu and cpu at the same time to work on first half and second half of work to finish it in %50 time.

Also don't put clFinish in loop. Put it just after the end of loop. This way it will enqueue it much faster and it already has in-order execution so it won't start others before finishing the first item. It is in-order queue I suppose and adding clfinish after each enqueue is extra overhead. Only a single clfinish after latest kernel is enough.

ATI RV730: 64 VLIW units, each has at least 4 streaming cores. 750 MHz.

i3-2100: 2 cores(threads just for anti-bubbling) each having AVX that capable of streaming 8 operations simultaneously. So this can have 16 operations in flight. More than 3 GHz.

Simply multiplication of streaming operations with frequencies:

ATI RV730 = 192 units (more with multiply-add functions, by 5th element of each vliw)

i3-2100 = 48 units

so gpu should be at least 4x as fast(use int4, uint4). This is for simple ALU and FPU operations such as bitwise operations and multiplications. Special functions such as trancandentals performance could be different since they run only on 5th unit in each vliw.

edited Feb 21 '17 at 09:56

answered Feb 21 '17 at 08:27

huseyin tugrul buyukisik

11,469
4
45
97

I did not consider the data transfer time in my measurement. I guss after copy, the gpu must perform faster comparing memory mapping (I'm not sure about this yet) – Afshin Feb 21 '17 at 08:33
then it is vliw microarchitecture needing 4-wide vectors instead of scalars you use. If there were 1M threads, now only 256k threads would finish job much quicker using int4 – huseyin tugrul buyukisik Feb 21 '17 at 08:36
1

delete clfinish from loop, put it just after the loop – huseyin tugrul buyukisik Feb 21 '17 at 08:38
last time used opencl profiling without the loop(for one time). Results didn't changed (I added them to the question). – Afshin Feb 21 '17 at 08:41
1

@Afshin running for one time means it is not optimized yet. It gets better after multiple running. Enqueue 10 times, clfinish once at the end – huseyin tugrul buyukisik Feb 21 '17 at 08:42
I guess multiple running is just to get a good estimation of time (if timing resolution is not enough which here is enough). Since running of kernel in the final process will be one time, it is not acceptable for code to be optimized in multiple running. – Afshin Feb 22 '17 at 05:39
No i mean just for benchmarking. You should try int4 and uint4 vectors – huseyin tugrul buyukisik Feb 22 '17 at 08:21

score 0 · Answer 2 · answered Feb 22 '17 at 06:28

I did some extra tests and realized that the GPU is optimized for floating point operations. I changed the the test code as below:

void kernel simple_add(global const int * A, global const uint * B, global int* C)
    {
        ///------------------------------------------------
        /// Add 16 bits of each
        int AA=A[get_global_id(0)];
        int BB=B[get_global_id(0)];
        float AH=0xFFFF0000 & AA;
        float AL=0x0000FFFF & AA;
        float BH=0xFFFF0000 & BB;
        float BL=0x0000FFFF & BB;
        int CL=(int)(AL*cos(AL)+BL*sin(BL))&0x0000FFFF;
        int CH=(int)(AH*cos(AH)+BH*sin(BL))&0xFFFF0000;
           C[get_global_id(0)]=CH|CL;               
     }

and got the result that I expected (about 10 time faster):

                CPU time:      741046.665  micro-sec
                GPU time:       54618.889  micro-sec
                ----------------------------------------------------
                CPU time:      741788.112  micro-sec
                GPU time:       54875.666  micro-sec
                ----------------------------------------------------
                CPU time:      739975.979  micro-sec
                GPU time:       54560.445  micro-sec
                ----------------------------------------------------
                CPU time:      755848.937  micro-sec
                GPU time:       54582.111  micro-sec
                ----------------------------------------------------
                CPU time:      724100.716  micro-sec
                GPU time:       56893.445  micro-sec
                ----------------------------------------------------
                CPU time:      744476.351  micro-sec
                GPU time:       54596.778  micro-sec
                ----------------------------------------------------
                CPU time:      727787.538  micro-sec
                GPU time:       54602.445  micro-sec
                ----------------------------------------------------
                CPU time:      731132.939  micro-sec
                GPU time:       54710.000  micro-sec
                ----------------------------------------------------
                CPU time:      727899.150  micro-sec
                GPU time:       54583.444  micro-sec
                ----------------------------------------------------
                CPU time:      727089.880  micro-sec
                GPU time:       54594.778  micro-sec
                ----------------------------------------------------

for a bit heavier floating point operations like below:

        void kernel simple_add(global const int * A, global const uint * B, global int* C)
            {
                ///------------------------------------------------
                /// Add 16 bits of each
                int AA=A[get_global_id(0)];
                int BB=B[get_global_id(0)];
                float AH=0xFFFF0000 & AA;
                float AL=0x0000FFFF & AA;
                float BH=0xFFFF0000 & BB;
                float BL=0x0000FFFF & BB;
                int CL=(int)(AL*(cos(AL)+sin(2*AL)+cos(3*AL)+sin(4*AL)+cos(5*AL)+sin(6*AL))+
                        BL*(cos(BL)+sin(2*BL)+cos(3*BL)+sin(4*BL)+cos(5*BL)+sin(6*BL)))&0x0000FFFF;
                int CH=(int)(AH*(cos(AH)+sin(2*AH)+cos(3*AH)+sin(4*AH)+cos(5*AH)+sin(6*AH))+
                        BH*(cos(BH)+sin(2*BH)+cos(3*BH)+sin(4*BH)+cos(5*BH)+sin(6*BH)))&0xFFFF0000;
                        C[get_global_id(0)]=CH|CL;

             }

The result was more or less the same:

                CPU time:     3905725.933  micro-sec
                GPU time:      354543.111  micro-sec
                -----------------------------------------
                CPU time:     3698211.308  micro-sec
                GPU time:      354850.333  micro-sec
                -----------------------------------------
                CPU time:     3696179.243  micro-sec
                GPU time:      354302.667  micro-sec
                -----------------------------------------
                CPU time:     3692988.914  micro-sec
                GPU time:      354764.111  micro-sec
                -----------------------------------------
                CPU time:     3699645.146  micro-sec
                GPU time:      354287.666  micro-sec
                -----------------------------------------
                CPU time:     3681591.964  micro-sec
                GPU time:      357071.889  micro-sec
                -----------------------------------------
                CPU time:     3744179.707  micro-sec
                GPU time:      354249.444  micro-sec
                -----------------------------------------
                CPU time:     3704143.214  micro-sec
                GPU time:      354934.111  micro-sec
                -----------------------------------------
                CPU time:     3667518.628  micro-sec
                GPU time:      354809.222  micro-sec
                -----------------------------------------
                CPU time:     3714312.759  micro-sec
                GPU time:      354883.888  micro-sec
                -----------------------------------------

My opencl test does not run much faster than CPU

2 Answers2

Linked