OpenCL Choosing Optimal Device for Throughput

Question

I am working with Cloo, an OpenCL C# library, and I was wondering how I can best determine which device to use for my kernels at runtime. What I really want to know is how many cores I have (compute units * cores per compute unit) on GPUs. How do I do this properly? I currently can determine compute units and frequency.

EDIT: I have considered trying to profile (run a speed test) on all devices and save/compare the results. But, from my understanding this poses a problem as well because you can't write a program that optimally/fairly uses all devices for comparison.

This would also be useful to choose an optimal number of worker threads to specify for every kernel call. Any help is greatly appreciated.

If you want to implement using c#, here is a good post about: http://stackoverflow.com/questions/1542213/how-to-find-the-number-of-cpu-cores-via-net-c — Ricardo Pontual, Mar 23 '16 at 16:36
Sorry, I wasn't specific, I need the number of cores on a gpu compute unit, so Environment won't help me unfortunately. — guitar80, Mar 23 '16 at 16:37

huseyin tugrul buyukisik · Accepted Answer · 2016-03-23T17:18:08.580

1

Judgement of performance by just core count is very hard. Some cores are wider, some are quicker. Even if they are same, different register space / local memory combinations make it even more difficult to guess.

Either you should have a database of each graphics card performance per driver per OS per algorithm and multiply them with current frequency or should simply benchmark them before selection or query performance timers of all devices while they are doing actual acceleration job.

A GTX680 and a HD7950 have similar number of cores but some algorithms favor HD7950 for extra %200 performance and opposite for some other codes.

You cannot query number of cores. You can query number of compute units and maximum number of threads per compute unit but they are not related to performance unless they are of same architecture.

You can query optimal thread number per work group but that can change with algorithm you use so you should try as many values as possible. Same for vectorized versions of a scalar function. If it is a cpu(or any vliw gpu) it can multiply 4 or 8 numbers at the same time.

Sometimes drivers' auto compiler optimization is as good as a hand tuned optimization.

https://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/clGetDeviceInfo.html

edited Mar 23 '16 at 17:18

answered Mar 23 '16 at 17:00

huseyin tugrul buyukisik

11,469
4
45
97

I was aware that it wasn't a perfect test, but I thought it might be useful at least to take a guess. How would I appropriately write a benchmark program that is fair to all potential devices? I feel like this poses the same problem. (There are obviously more parameters then just raw processing power that determines performance) – guitar80 Mar 23 '16 at 17:06
1

A benchmark could sweep through all available thread-group-size values(2,4,8,32,64,...,1024) and also could apply both vectorized(float4,float16) and scalar (float) versions' performances into considerations, there are many other options but these are the most important ones imo. These optimal values are also queryable with clGetDeviceInfo – huseyin tugrul buyukisik Mar 23 '16 at 17:10
So then I just compare who ran better on most of the thread group size values? – guitar80 Mar 23 '16 at 17:13
Maybe not most of them but any one that combined with proper vectorized branch (and some other options) that surpass other devices' benchmark points. So it can apply tens of different versions of a kernel using combinations and get best one(on energy consumption, performance and compile times maybe) – huseyin tugrul buyukisik Mar 23 '16 at 17:14
Alright. Well I understand the general concept, but I still feel like its hard to make a really solid comparison, but thank you! I will work with this. – guitar80 Mar 23 '16 at 17:15

OpenCL Choosing Optimal Device for Throughput

1 Answers1