1

I'm applying UVA and OpenMP in my algorithm to make it powerful.

The thing is that when I launch a parallel kernel, that is for example, 3 CPU threads launch one kernel at the same time. One thread has nan values.

It seems that GPU X cannot read a variable from GPU0.

That is weird taking into account that I grant access to every GPU to 0 (In this case 1 and 2).

Is there a problem to use UVA and OpenMP together? Or is a problem of the code?

Here is the code and the results.

I've created a MCVE to demonstrate the error here:

#include <stdio.h>
#include <stdlib.h>
#include <cuda.h>
#include <math.h>
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include "math_constants.h"
#include <omp.h>
#include <cufft.h>

inline bool IsGPUCapableP2P(cudaDeviceProp *pProp)
{
  #ifdef _WIN32
     return (bool)(pProp->tccDriver ? true : false);
  #else
     return (bool)(pProp->major >= 2);
  #endif
}

inline bool IsAppBuiltAs64()
{ 
  #if defined(__x86_64) || defined(AMD64) || defined(_M_AMD64)
     return 1;
  #else
     return 0;
  #endif
}

__global__ void kernelFunction(cufftComplex *I, int i, int N)
{
    int j = threadIdx.x + blockDim.x * blockIdx.x;
    int k = threadIdx.y + blockDim.y * blockIdx.y;

   if(j==0 & k==0){
       printf("I'm thread %d and I'm reading device_I[0] = %f\n", i, I[N*j+k].x);
   }
}

__host__ int main(int argc, char **argv) {
int num_gpus;
cudaGetDeviceCount(&num_gpus);

if(num_gpus < 1){
    printf("No CUDA capable devices were detected\n");
        return 1;
}

if (!IsAppBuiltAs64()){
    printf("%s is only supported with on 64-bit OSs and the application must be built as a 64-bit target. Test is being waived.\n", argv[0]);
    exit(EXIT_SUCCESS);
}



printf("Number of host CPUs:\t%d\n", omp_get_num_procs());
printf("Number of CUDA devices:\t%d\n", num_gpus);


for(int i = 0; i < num_gpus; i++){
    cudaDeviceProp dprop;
        cudaGetDeviceProperties(&dprop, i);

        printf("> GPU%d = \"%15s\" %s capable of Peer-to-Peer (P2P)\n", i, dprop.name, (IsGPUCapableP2P(&dprop) ? "IS " : "NOT"));

        //printf("   %d: %s\n", i, dprop.name);
}
printf("---------------------------\n");


num_gpus = 3; //The case that fails
omp_set_num_threads(num_gpus);

if(num_gpus > 1){
  for(int i=1; i<num_gpus; i++){
        cudaDeviceProp dprop0, dpropX;
        cudaGetDeviceProperties(&dprop0, 0);
        cudaGetDeviceProperties(&dpropX, i);
        int canAccessPeer0_x, canAccessPeerx_0;
        cudaDeviceCanAccessPeer(&canAccessPeer0_x, 0, i);
        cudaDeviceCanAccessPeer(&canAccessPeerx_0 , i, 0);
        printf("> Peer-to-Peer (P2P) access from %s (GPU%d) -> %s (GPU%d) : %s\n", dprop0.name, 0, dpropX.name, i, canAccessPeer0_x ? "Yes" : "No");
            printf("> Peer-to-Peer (P2P) access from %s (GPU%d) -> %s (GPU%d) : %s\n", dpropX.name, i, dprop0.name, 0, canAccessPeerx_0 ? "Yes" : "No");
        if(canAccessPeer0_x == 0 || canAccessPeerx_0 == 0){
            printf("Two or more SM 2.0 class GPUs are required for %s to run.\n", argv[0]);
            printf("Support for UVA requires a GPU with SM 2.0 capabilities.\n");
            printf("Peer to Peer access is not available between GPU%d <-> GPU%d, waiving test.\n", 0, i);
            exit(EXIT_SUCCESS);
        }else{
            cudaSetDevice(0);
                printf("Granting access from 0 to %d...\n", i);
            cudaDeviceEnablePeerAccess(i,0);
            cudaSetDevice(i);
                printf("Granting access from %d to 0...\n", i);
            cudaDeviceEnablePeerAccess(0,0);

            printf("Checking GPU%d and GPU%d for UVA capabilities...\n", 0, 1);
            const bool has_uva = (dprop0.unifiedAddressing && dpropX.unifiedAddressing);
            printf("> %s (GPU%d) supports UVA: %s\n", dprop0.name, 0, (dprop0.unifiedAddressing ? "Yes" : "No"));
                printf("> %s (GPU%d) supports UVA: %s\n", dpropX.name, i, (dpropX.unifiedAddressing ? "Yes" : "No"));
            if (has_uva){
                    printf("Both GPUs can support UVA, enabling...\n");
                }
                else{
                    printf("At least one of the two GPUs does NOT support UVA, waiving test.\n");
                    exit(EXIT_SUCCESS);
                }
        }
    }
}

int M = 512;
int N = 512;

cufftComplex *host_I = (cufftComplex*)malloc(M*N*sizeof(cufftComplex));
for(int i=0;i<M;i++){
    for(int j=0;j<N;j++){
        host_I[N*i+j].x = 0.001;
        host_I[N*i+j].y = 0;
    }
}

cufftComplex *device_I;
cudaSetDevice(0);
cudaMalloc((void**)&device_I, sizeof(cufftComplex)*M*N);
cudaMemset(device_I, 0, sizeof(cufftComplex)*M*N);
cudaMemcpy2D(device_I, sizeof(cufftComplex), host_I, sizeof(cufftComplex), sizeof(cufftComplex), M*N, cudaMemcpyHostToDevice);

dim3 threads(32,32);
dim3 blocks(M/threads.x, N/threads.y);
dim3 threadsPerBlockNN = threads;
dim3 numBlocksNN = blocks;
#pragma omp parallel
{
    unsigned int i = omp_get_thread_num();
    unsigned int num_cpu_threads = omp_get_num_threads();

    // set and check the CUDA device for this CPU thread
    int gpu_id = -1;
    cudaSetDevice(i % num_gpus);   // "% num_gpus" allows more CPU threads than GPU devices
    cudaGetDevice(&gpu_id);
    //printf("CPU thread %d (of %d) uses CUDA device %d\n", cpu_thread_id, num_cpu_threads, gpu_id);
    kernelFunction<<<numBlocksNN, threadsPerBlockNN>>>(device_I, i, N);
    cudaDeviceSynchronize();
}

cudaFree(device_I);

for(int i=1; i<num_gpus; i++){
    cudaSetDevice(0);
    cudaDeviceDisablePeerAccess(i);
    cudaSetDevice(i);
    cudaDeviceDisablePeerAccess(0);
}

for(int i=0; i<num_gpus; i++ ){
    cudaSetDevice(i);
    cudaDeviceReset();
}

free(host_I);




}

The results are:

Both GPUs can support UVA, enabling...

I'm thread 0 and I'm reading device_I[0] = 0.001000

I'm thread 2 and I'm reading device_I[0] = 0.001000

I'm thread 1 and I'm reading device_I[0] = -nan

The command line to compile is:

nvcc -Xcompiler -fopenmp -lgomp -arch=sm_37 main.cu -lcufft

Here is the result of simpleP2P:

[miguel.carcamo@belka simpleP2P]$ ./simpleP2P 
[./simpleP2P] - Starting...
Checking for multiple GPUs...
CUDA-capable device count: 8
> GPU0 = "      Tesla K80" IS  capable of Peer-to-Peer (P2P)
> GPU1 = "      Tesla K80" IS  capable of Peer-to-Peer (P2P)
> GPU2 = "      Tesla K80" IS  capable of Peer-to-Peer (P2P)
> GPU3 = "      Tesla K80" IS  capable of Peer-to-Peer (P2P)
> GPU4 = "      Tesla K80" IS  capable of Peer-to-Peer (P2P)
> GPU5 = "      Tesla K80" IS  capable of Peer-to-Peer (P2P)
> GPU6 = "      Tesla K80" IS  capable of Peer-to-Peer (P2P)
> GPU7 = "      Tesla K80" IS  capable of Peer-to-Peer (P2P)

Checking GPU(s) for support of peer to peer memory access...
> Peer-to-Peer (P2P) access from Tesla K80 (GPU0) -> Tesla K80 (GPU1) : Yes
> Peer-to-Peer (P2P) access from Tesla K80 (GPU1) -> Tesla K80 (GPU0) : Yes
Enabling peer access between GPU0 and GPU1...
Checking GPU0 and GPU1 for UVA capabilities...
> Tesla K80 (GPU0) supports UVA: Yes
> Tesla K80 (GPU1) supports UVA: Yes
Both GPUs can support UVA, enabling...
Allocating buffers (64MB on GPU0, GPU1 and CPU Host)...
Creating event handles...
cudaMemcpyPeer / cudaMemcpy between GPU0 and GPU1: 0.79GB/s
Preparing host buffer and memcpy to GPU0...
Run kernel on GPU1, taking source data from GPU0 and writing to GPU1...
Run kernel on GPU0, taking source data from GPU1 and writing to GPU0...
Copy data back to host from GPU0 and verify results...
Verification error @ element 0: val = nan, ref = 0.000000
Verification error @ element 1: val = nan, ref = 4.000000
Verification error @ element 2: val = nan, ref = 8.000000
Verification error @ element 3: val = nan, ref = 12.000000
Verification error @ element 4: val = nan, ref = 16.000000
Verification error @ element 5: val = nan, ref = 20.000000
Verification error @ element 6: val = nan, ref = 24.000000
Verification error @ element 7: val = nan, ref = 28.000000
Verification error @ element 8: val = nan, ref = 32.000000
Verification error @ element 9: val = nan, ref = 36.000000
Verification error @ element 10: val = nan, ref = 40.000000
Verification error @ element 11: val = nan, ref = 44.000000
Enabling peer access...
Shutting down...
Test failed!
Robert Crovella
  • 143,785
  • 11
  • 213
  • 257
  • 1
    Are you sure the `nan` doesn't come from the expression `I[N*j+k].x`? – Weather Vane Dec 28 '15 at 18:25
  • 5
    You may have a problem with your specific machine. [This](http://pastebin.com/5Qe90R1U) is the output when I run your program on a K80 cluster that is properly set up. [This](http://pastebin.com/1uRCkXD4) is the output in another cluster that has older Fermi GPUs (M2050/M2070) that are also properly set up for P2P. So I'm unable to reproduce your issue on two different configurations. You may want to start by adding **rigorous** CUDA [error checking](http://stackoverflow.com/questions/14038589) -- check the return code from **every** cuda runtime API call in your program. – Robert Crovella Dec 28 '15 at 18:46
  • @RobertCrovella - Your comment would make a good answer, I think. – ryyker Dec 28 '15 at 18:59
  • 1
    @ryyker I believe "it works for me" is not an acceptable answer on SO. That is based on advice that was given to me when I first started on SO a few years ago, from someone more experienced than I. So I generally follow that rule. You may disagree. In this case I believe there is a possibility of more fruitful dialog with OP, although whether such a dialog and underlying question is a good fit for SO is debatable. – Robert Crovella Dec 28 '15 at 19:04
  • 1
    Your comment (with links), in this case, provides more than just a simple _works for me_. It provides very relevant and helpful information specific to OP question. I would up vote. But also understand sticking to your own rules. – ryyker Dec 28 '15 at 19:16
  • In the event that the CUDA error checking turns up no errors, then the troubleshooting process would generally follow the path outlined [here](https://devtalk.nvidia.com/default/topic/883054/cuda-programming-and-performance/multi-gpu-peer-to-peer-access-failing-on-tesla-k80-/1). 1. run the simpleP2P cuda sample code. If it indicates verification errors, then there is possibly a problem with your motherboard supporting P2P. 2. If the simple P2P test turns up verification errors, then check to see if the latest system BIOS is installed on the motherboard. – Robert Crovella Dec 28 '15 at 21:14
  • 3. If the latest system BIOS does not fix the issue, then there is possibly additional investigation that can be done around PCIE ACS settings, but at that point the problem should probably be referred to the system vendor. And if the system was not properly qualified by the system vendor for K80 usage (i.e. somebody assembled the system from random pieces) then there is no assurance that this could ever work. – Robert Crovella Dec 28 '15 at 21:14
  • I've added the result of simpleP2P. So the next steps should be check the problem with the motherboard? That is check if the PCIE ACS setting is enabled and check if the BIOS has the latest system? – Miguel Cárcamo Dec 28 '15 at 22:26
  • 2
    Yes, the verification errors in the simpleP2P test indicate that the problem is not in the code you have shown, but in the system you are working on. I would suggest the next step is to make sure your system/motherboard has the latest BIOS installed on it. If it does not, try installing the latest BIOS. With the latest BIOS, repeat the simpleP2P test. If it is still failing, go into the BIOS setup and find out if there are any settings that allow you to enable/disable ACS. If so, you want to try disabling ACS. – Robert Crovella Dec 28 '15 at 22:45
  • 1
    What kind of system (manufacturer, model number) are these K80's installed in? – Robert Crovella Dec 28 '15 at 22:50
  • I don't know, I will ask and I'll let you know – Miguel Cárcamo Dec 29 '15 at 00:50
  • product: Z10PE-D8 WS (All) vendor: ASUSTeK COMPUTER INC. width: 64 bits capabilities: smbios-2.8 dmi-2.7 vsyscall64 vsyscall32 physical id: 0 version: Rev 1.xx – Miguel Cárcamo Dec 29 '15 at 18:16
  • - From the version before the last one, the BIOS corrects the ACS MoBo ASUS Z10PE-D8 WS - It's necessary to activate the memory mapping over 4Gb (if not there is no booting) BIOS > Advanced > PCIe/PCI/PnP Configuration > Above 4G Decoding and set it to Enabled. – Miguel Cárcamo Dec 29 '15 at 18:19
  • So is the simpleP2P test working now? I don't believe the system you are using has been validated for K80 GPUs, so I hope you are providing adequate power and cooling to the GPUs. You may want to monitor the GPU temperatures. – Robert Crovella Dec 29 '15 at 20:00
  • Yes, simpleP2P it's working and my program is also working. I think that the people in charge has provide adequate power and cooling. – Miguel Cárcamo Dec 29 '15 at 20:48
  • Yesterday ocurred something weird from nothing... when I did nvidia-smi after the update the command showed all the 8 GPUs (4 Tesla K80). Now only shows 4 GPUs. How that could happen and how can I fix it? – Miguel Cárcamo Dec 30 '15 at 19:06
  • It could happen if the GPUs are overheating. There may be any number of other possibilities as well. The best advice is to use an OEM-qualified configuration. Building your own configuration (which is what the "people in charge" seem to have done there), is more likely to lead to problems like this. – Robert Crovella Dec 31 '15 at 02:44

1 Answers1

3

It seems, based on the debugging in the comments, that the problem was ultimately related to the system that was being used, not OP's code.

K80 is a dual-GPU device, so it has a PCIE bridge chip on-board. Proper use of this configuration, especially when using Peer-to-Peer (P2P) traffic requires proper settings in the upstream PCIE switches and/or root complex. These settings are normally made by the system BIOS, and are not normally/typically software-configurable.

One possible indicator when these settings are incorrect is that the simpleP2P CUDA sample code will report errors during results validation. Therefore, a good test on any system where you are having trouble with P2P code is to run this particular CUDA sample code (simpleP2P). If validation errors are reported (see OP's posting for an example), then these should be addressed first, before any attempt is made to debug the user's P2P code.

The best recommendation is to use a system that has been validated by the system vendor for K80 usage. This is generally good practice for any usage of Tesla GPUs, as these GPUs tend to make significant demands on the host system from the standpoints of:

  • power delivery
  • cooling requirements
  • system compatibility (two examples are the types of PCIE settings being discussed here, as well as resource mapping and bootability issues also referred to by OP in the comments)

OEM validated systems will generally have the fewest issues associated with the above requirements/demands that Tesla GPUs place on the host system.

For this particular issue, troubleshooting starts with the simpleP2P test. When validation errors are observed in that test (but no other CUDA runtime errors are reported) then the PCIE settings may be suspect. The easiest way to attempt to address these are by checking for a newer/updated system BIOS which may have the settings correct for this type of usage, or else will offer a BIOS setup option that allows the user to make the necessary changes. The settings involved here are PCIE ACS settings, and if a BIOS setup option is available, those terms will likely be involved. Since BIOS setup varies from system to system, it's not possible to be specific here.

If the BIOS update and/or settings modification does not resolve the issue, then it's probably not fixable for that particular system type. It's possible to troubleshoot the process a bit further using the final steps described here but such troubleshooting, even if successful, cannot lead to a permanent (i.e. will survive a reboot) fix without BIOS modifications.

If the simpleP2P test runs correctly, then debug focus should return to the user's code. General recommendations of using proper cuda error checking and running the code with cuda-memcheck apply. Furthermore, the simpleP2P sample source code can be then referred to as an example of correct usage of P2P functionality.

Note that in general, P2P support may vary by GPU or GPU family. The ability to run P2P on one GPU type or GPU family does not necessarily indicate it will work on another GPU type or family, even in the same system/setup. The final determinant of GPU P2P support are the tools provided that query the runtime via cudaDeviceCanAccessPeer. P2P support can vary by system and other factors as well. No statements made here are a guarantee of P2P support for any particular GPU in any particular setup.

Robert Crovella
  • 143,785
  • 11
  • 213
  • 257