1

I'm new with OpenCL and have some problems with the array additions I use the code provided in the link below

http://code.google.com/p/opencl-book-samples/source/browse/#svn%2Ftrunk%2Fsrc%2FChapter_2%2FHelloWorld%253Fstate%253Dclosed

and I added some parts to measure the performance of the GPU

clFinish(commandQueue);
        // Queue the kernel up for execution across the array
        cl_ulong start, end; cl_event  k_events;

        errNum = clEnqueueNDRangeKernel(commandQueue, kernel, 1, NULL,
                                        globalWorkSize, localWorkSize,
                                        0, NULL, &k_events);
         clGetEventProfilingInfo(k_events, CL_PROFILING_COMMAND_START,  
                            sizeof(cl_ulong), &start, NULL); 
         clWaitForEvents(1 , &k_events);

    clGetEventProfilingInfo(k_events, CL_PROFILING_COMMAND_END, 
                            sizeof(cl_ulong), &end, NULL); 
    clGetEventProfilingInfo(k_events, CL_PROFILING_COMMAND_START,  
                            sizeof(cl_ulong), &start, NULL); 

    float GPUTime = (end - start);

And this to measure the CPU time

LARGE_INTEGER CPUstart, finish, freq;
QueryPerformanceFrequency(&freq);
QueryPerformanceCounter(&CPUstart);

for (int i=0;i<ARRAY_SIZE;i++){

    result[i]=a[i]+b[i];
}

QueryPerformanceCounter(&finish);
double timeCPU=(finish.QuadPart - CPUstart.QuadPart) /((double)freq.QuadPart)/1000000000.0) ;

The first problem I encountered is the array size ; it can't go beyond 10000 ; if I do this ; it just crash . How to fix it ?

The second problem is the performance ; the GPU/CPU ratio range is too wide ; from 13% to 210%(ish) . Why does this happen and can you suggest a fix ?

Edit : I figured out the 2nd ; the lag was caused by the power saving mode ; it set the core/mem to much lower than default . Just use a program to lock it ; and the performance are rocking stable at ~150-300 % (GPU/CPU)

Good case

GPU time :632667 nanosecs.
CPU time : 990023 nanosecs.
GPU/CPU ratio : 156.484 percent.

And bad one :

GPU time :6.83267e+006 nanosecs.
CPU time : 1.00756e+006 nanosecs.
GPU/CPU ratio : 14.7462 percent.

Any ideas will be appreciated . Thank you :D

PS : The CPU is core i3-370M ; GPU : HD5470 . I use VS2008 on windows 7

Tiana987642
  • 696
  • 2
  • 10
  • 28
  • Whow. You like outdated dev environments combined with decent OS? THe 5470 may be memory strapped - did you look at the numbers to check you do not overlaod the poor 5470? – TomTom Jul 09 '12 at 13:05
  • So you suggest me update to the newer version of VS ? So I will try – Tiana987642 Jul 09 '12 at 13:12
  • Itmaeks little sense to not do it. If it makes sense, go all the way to 20212 - it will be out of release candidate in 2-3 months, so depending on project length it is worth it. It is also fully supported by Microsoft (the GC). – TomTom Jul 09 '12 at 13:13
  • I know HD5470 is weak ; but I don't think that the addition of 2 arrays with 10000 elements can stress it – Tiana987642 Jul 09 '12 at 13:14

4 Answers4

1

Here's a good answer that helps you with why you are reaching your limit

Is there a max array length limit in C++?

If you can figure out a way to create memory management in your code that might help alleviate some of your problems.

Btw, I'd look to other OSs, like a linux environment that might be able to help run your code. Windows is full of memory hogging services and might be a factor in your problem. Or you can just get better hardware.

Community
  • 1
  • 1
sksallaj
  • 3,872
  • 3
  • 37
  • 58
1

A few things:

If your local work size does not round into your global work size, you may end up with a small leftover fraction. I.E.: local size is 100 and global size 1050 -> 50 extra. This bit IIRC still gets processed. A fix to that problem is to a) make sure you round evenly, or b) check a guard variable in the kernel and abort if it is outside the range.

Secondly, I noticed some strangeness with clGetEventProfilingInfo where sometimes it would be quite accurate and sometimes it would be quite inaccurate. I ended up using clFinish and QueryPerformanceCounter to benchmark my CL code.

nullspace
  • 1,279
  • 13
  • 13
  • I will notice about your suggestion ; thank :) About the performance ; I think windows power management causes the lag issue . I run the program ~30 times with the high performance option and feel pretty sure about it :) – Tiana987642 Jul 10 '12 at 11:27
1

One possible (and most probable) reason that your program crashes with bigger array sizes is due to the following code in main.cpp (lines 274-276 in the original code):

float result[ARRAY_SIZE];
float a[ARRAY_SIZE];
float b[ARRAY_SIZE];

These are automatic arrays and space for them is allocated on the stack of the main function. The total space required is 3*ARRAY_SIZE*sizeof(float) which equals 12*ARRAY_SIZE. The default stack size on Windows is 1 MiB which means ARRAY_SIZE could be up to 87380. This is the upper limit given the default stack size and since the stack is also used for other things too, the real value would be somewhat lower.

You can increase the stack size on the Linker -> System page of your VS project properties. Or better allocate those arrays on the heap using malloc() or new[].

Hristo Iliev
  • 72,659
  • 12
  • 135
  • 186
  • Thank ; Your idea's actually works malloc() works till 10^7 ; which's great . Larger number of elements will cause an error ; but I ain't need more :D – Tiana987642 Jul 10 '12 at 12:00
  • 1
    Well, you are still limited by the amount of memory that your GPU has since the arrays are first copied to the GPU, computation is done there and then the result array is copied back to the main memory and three `float` arrays of 10 mio ellements are already taking more than 100 MB. – Hristo Iliev Jul 10 '12 at 12:17
0

You can use clGetDeviceInfo API call to determine two key parameters for your OpenCL device

CL_DEVICE_MAX_MEM_ALLOC_SIZE and CL_DEVICE_GLOBAL_MEM_SIZE

http://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/clGetDeviceInfo.html

these determine how much Global memory you are able to use and how much you can allocate.

Tim Child
  • 2,994
  • 1
  • 26
  • 25
  • Thank ; here's the info : Device name = Cedar Driver version = CAL 1.4.1664 (VM) Global Memory (MB):512 Local Memory (KB):32 MAX_MEM_ALLOC_SIZE(KB): 195072 It's still in the limit ; right ? – Tiana987642 Jul 10 '12 at 11:36