0

This is my minimal not-working test case. I'm running this on the Stampede cluster (where CUDA/the rest of the environment) is automatically set up, so there should be no issues there.

When I run it, the output host b array never changes (nor is it even copied from the device). My output is 0.0, 0.0 - it should obviously be 100.0, 100.0.

#include <cuda.h>
#include <stdio.h>

struct point {
    double x,y;
};

__global__ void MyFunc(point* d_a) {
    d_a->x = 100.0;
    d_a->y = 100.0;
}

int main(void) {
    point * a = (point*)malloc(sizeof(point));
    a->x=10.0;
    a->y=10.0;
    point * d_a;
    cudaMalloc((void**)&d_a,sizeof(point));
    cudaMemcpy(d_a,a,sizeof(point),cudaMemcpyHostToDevice);
    cudaDeviceSynchronize();

    MyFunc<<<1,1>>>(d_a);

    cudaDeviceSynchronize();
    point * b = (point*)malloc(sizeof(point));
    cudaMemcpy(b,d_a,sizeof(point),cudaMemcpyDeviceToHost);
    printf("%lf %lf\n",b->x,b->y);

    cudaFree(d_a);
    free(a);
    free(b);
    return 0;
}

Code compiles fine with CUDA 6.5.12 using:

nvcc -c -O3 -arch=compute_35 -code=sm_35 test.cu

Running it can be done using the ibrun ./test command on an interactive terminal session (idev -n 1 -N 1).

No errors/segfaults are reported during run; it just gives the wrong output. Is it something wrong with the code, or am I just running it incorrectly?

talonmies
  • 70,661
  • 34
  • 192
  • 269
NoseKnowsAll
  • 4,593
  • 2
  • 23
  • 44
  • Why do you call `cudaDeviceSynchronize()` ? – pSoLT Mar 10 '17 at 08:15
  • @pSoLT I'm not sure if it's necessary or not. I just wanted to make sure that the data would be fully loaded on the GPU before calling the `global` function, and then fully done with the `global` function before trying to copy the data back to the host. – NoseKnowsAll Mar 10 '17 at 08:17
  • `cudaMemcpy` calls are synchronous, so you don't need to worry about that. – pSoLT Mar 10 '17 at 08:19
  • it's probably a good idea to check return codes on cuda api calls. Let us know if you have any non-zero error code on any of the calls. – pSoLT Mar 10 '17 at 08:28
  • @pSoLT I was doing that initially... until I discovered that calling every single memory check cuda call was segfaulting instantly. The check itself was segfaulting - not the code. So I removed all checks from this test program. – NoseKnowsAll Mar 10 '17 at 08:34
  • they shouldn't segfault. Try use [this](https://codeyarns.com/2011/03/02/how-to-do-error-checking-in-cuda/) – pSoLT Mar 10 '17 at 08:37
  • 1
    There is nothing wrong with your code. Are you certain that your CUDA installation actually works? – talonmies Mar 10 '17 at 10:17
  • @talonmies I'm running on a national facility that I presume has installed CUDA correctly. Did you run the code to confirm it works? – NoseKnowsAll Mar 10 '17 at 18:05
  • 2
    I did indeed run it and it works exactly as expected – talonmies Mar 10 '17 at 18:07
  • ok second try -- are you *really* sure that there are no errors produced at runtime. Do you actually check for errors? If so, how? Are you really certain that the GPUs in question are compute capability 3.5? – talonmies Mar 10 '17 at 21:12
  • 2
    If you are using [proper cuda error checking](http://stackoverflow.com/questions/14038589) and it is segfaulting, then you should stop right there, and figure that out. It's a critical issue with your environment and there is little point in asking why other codes are not working correctly. My guess is you have made an error in terms of what you think correct error checking is, if you are getting segfaults. But even if you have made no errors, and your claims are valid, it indicates that your environment is seriously broken. – Robert Crovella Mar 11 '17 at 18:37
  • 2
    a few other comments:1. `nvcc -c ...` is a compile command but does not produce an executable. If you are trying to run the output of `nvcc -c ...` directly, you are probably going to get something like a segfault. 2. most of the nodes on [stampede](https://portal.tacc.utexas.edu/user-guides/stampede#running-idev) do not [have GPUs](https://portal.tacc.utexas.edu/user-guides/stampede#accelerator-cuda-programming). It's not obvious to me that a simple `idev` command like the one you are showing is going to deposit your interactive session on a GPU-equipped node, but I am not a stampede expert – Robert Crovella Mar 11 '17 at 18:39
  • CUDA 6.5.12 seems like an odd version of CUDA, it does not match the [released version of CUDA 6.5](https://developer.nvidia.com/cuda-toolkit-65) – Robert Crovella Mar 11 '17 at 18:44
  • @RobertCrovella I was running the actual executable after I ran `nvcc -o`, linking correctly. Indeed, `idev` doesn't necessarily give you a GPU node, but apparently you can add some options so that it will. I'm writing an answer that describes my workaround to this issue. Indeed, the interactive environment wasn't working as expected. – NoseKnowsAll Mar 11 '17 at 21:13

1 Answers1

0

Indeed, as has been discussed in the comments, the interactive environment on a GPU node (using idev) was not working as expected. While I don't have a true "answer" per se to my question, I have now figured out a work-around. I welcome others who see this question to try and explain why my original setup was not working.

Compiling the code was not an issue because the code is actually correctly, as confirmed by @pSoLT and later me (aside if you read the comments: the cuda error checking code now works too). The issue is that the interactive environment will not let you communicate with the accelerator on the compute node. To fix the problem: instead of logging interactively onto a compute node to run the CUDA code, I submitted a sbatch script from the default login node. Something like the following worked for me:

#!/bin/bash
#SBATCH -A XXXXXXX
#SBATCH -J testRun
#SBTACH -o output/testOutput
#SBATCH -n 1
#SBATCH -p projectName
#SBATCH -t 00:01:00

ibrun ./test
NoseKnowsAll
  • 4,593
  • 2
  • 23
  • 44
  • please remember to come back and accept this in,a couple of days so that it falls off the unanswered question queue for the CUDA tag – talonmies Mar 12 '17 at 04:54