How to pick target hardware for opencl

Question

How do you tell OpenCL to target build for a gpu instead of a cpu? will it automatically pick one over the other?

ProjectPhysX · Answer 1 · 2022-08-17T10:56:46.770

OpenCL will not automatically pick a device for you. You have to explicitly choose a platform (Intel/AMD/Nvidia) and a device (CPU/GPU) on that platform. Platform #0 and device #0 by default will not always give you the GPU. This is quite cumbersome when running code on different computers, as on each you have to manually select the device.

However there is a smart solution for this, a lightweight OpenCL-Wrapper that automatically picks the fastest available GPU (or CPU if no GPU is available) for you. This works by reading out the number of compute units and clock frequency and adding missing information (number of cores per CU) via vendor and device name with a small database.

Find the source code with an example here.

Here is just the code for automatically selecting the fastest device:

vector<cl::Device> cl_devices; // get all devices of all platforms
{
    vector<cl::Platform> cl_platforms; // get all platforms (drivers)
    cl::Platform::get(&cl_platforms);
    for(uint i=0u; i<(uint)cl_platforms.size(); i++) {
        vector<cl::Device> cl_devices_available;
        cl_platforms[i].getDevices(CL_DEVICE_TYPE_ALL, &cl_devices_available); // to query only GPUs, use CL_DEVICE_TYPE_GPU here
        for(uint j=0u; j<(uint)cl_devices_available.size(); j++) {
            cl_devices.push_back(cl_devices_available[j]);
        }
    }
}
cl::Device cl_device; // select fastest available device
{
    float best_value = 0.0f;
    uint best_i = 0u; // index of fastest device
    for(uint i=0u; i<(uint)cl_devices.size(); i++) { // find device with highest (estimated) floating point performance
        const string name = trim(cl_devices[i].getInfo<CL_DEVICE_NAME>()); // device name
        const string vendor = trim(cl_devices[i].getInfo<CL_DEVICE_VENDOR>()); // device vendor
        const uint compute_units = (uint)cl_devices[i].getInfo<CL_DEVICE_MAX_COMPUTE_UNITS>(); // compute units (CUs) can contain multiple cores depending on the microarchitecture
        const uint clock_frequency = (uint)cl_devices[i].getInfo<CL_DEVICE_MAX_CLOCK_FREQUENCY>(); // in MHz
        const bool is_gpu = cl_devices[i].getInfo<CL_DEVICE_TYPE>()==CL_DEVICE_TYPE_GPU;
        const uint ipc = is_gpu?2u:32u; // IPC (instructions per cycle) is 2 for GPUs and 32 for most modern CPUs
        const bool nvidia_192_cores_per_cu = contains_any(to_lower(name), {" 6", " 7", "ro k", "la k"}) || (clock_frequency<1000u&&contains(to_lower(name), "titan")); // identify Kepler GPUs
        const bool nvidia_64_cores_per_cu = contains_any(to_lower(name), {"p100", "v100", "a100", "a30", " 16", " 20", "titan v", "titan rtx", "ro t", "la t", "ro rtx"}) && !contains(to_lower(name), "rtx a"); // identify P100, Volta, Turing, A100, A30
        const bool amd_128_cores_per_dualcu = contains(to_lower(name), "gfx10"); // identify RDNA/RDNA2 GPUs where dual CUs are reported
        const float nvidia = (float)(contains(to_lower(vendor), "nvidia"))*(nvidia_192_cores_per_cu?192.0f:(nvidia_64_cores_per_cu?64.0f:128.0f)); // Nvidia GPUs have 192 cores/CU (Kepler), 128 cores/CU (Maxwell, Pascal, Ampere) or 64 cores/CU (P100, Volta, Turing, A100)
        const float amd = (float)(contains_any(to_lower(vendor), {"amd", "advanced"}))*(is_gpu?(amd_128_cores_per_dualcu?128.0f:64.0f):0.5f); // AMD GPUs have 64 cores/CU (GCN, CDNA) or 128 cores/dualCU (RDNA, RDNA2), AMD CPUs (with SMT) have 1/2 core/CU
        const float intel = (float)(contains(to_lower(vendor), "intel"))*(is_gpu?8.0f:0.5f); // Intel integrated GPUs usually have 8 cores/CU, Intel CPUs (with HT) have 1/2 core/CU
        const float arm = (float)(contains(to_lower(vendor), "arm"))*(is_gpu?8.0f:1.0f); // ARM GPUs usually have 8 cores/CU, ARM CPUs have 1 core/CU
        const uint cores = to_uint((float)compute_units*(nvidia+amd+intel+arm)); // for CPUs, compute_units is the number of threads (twice the number of cores with hyperthreading)
        const float tflops = 1E-6f*(float)cores*(float)ipc*(float)clock_frequency; // estimated device FP32 floating point performance in TeraFLOPs/s
        if(tflops>best_value) {
            best_value = tflops;
            best_i = i;
        }
    }
    const string name = trim(cl_devices[best_i].getInfo<CL_DEVICE_NAME>()); // device name
    cl_device = cl_devices[best_i];
    print_info(name); // print device name
}

Alternatively, you can also make it automatically choose the device with most memory rather than FLOPs, or a device with specified ID from the list of all devices from all platforms. There is many more benefits to using this wrapper, for example significantly simpler code for using arrays and automatic tracking of total device memory allocation, all while not impacting performance in any way.

How to pick target hardware for opencl

1 Answers1

Linked