CUDA Unified Memory Working (in specific, cudaMallocManaged();)

Question

I recently have been playing around with CUDA, and was hoping to try out the unified memory model. I tried playing with sample code, and strangely, when launching the kernel, no values seemed to be updating. Modifying unified data from the host works fine, yet kernels launched simply won't modify the unified data.

My card is a GTX 770 with 4GB of memory. I'm running Arch Linux, kernel 3.14-2, using GCC 4.8 to compile my samples. I'm setting the compute arch to sm_30, and activative -m64 flag

Here's one sample that I was playing with. X[0] and X[1] always evaluate to 0, even when the kernel launches.

#include<stdio.h>
#include <cuda.h>

__global__ void kernel(int* x){
    x[threadIdx.x] = 2;
}

int main(){
    int* x;
    cudaMallocManaged(&x, sizeof(int) * 2);
    cudaError_t error = cudaGetLastError();
    printf("%s\n", error);
    x[0] = 0;
    x[1] = 0;

    kernel<<<1, 2>>>(x);
    cudaDeviceSynchronize();

    printf("result = %d\n", x[1]);

    cudaFree(x);
    return 0;
}

Another sample is this:

__global__ void adjacency_map_init_gpu(adjacency_map_t* map){
    int row = threadIdx.y + blockIdx.y * blockDim.y;
    int col = threadIdx.x + blockIdx.x * blockDim.x;

    int i = row * map->width + col;

    max(i, 0);
    min(i, map->width * map->height);

    map->connections[i] = 0;
}

__global__ void adjacency_map_connect_gpu(edge_t* edges, int num_edges, adjacency_map_t* map){

    int i = threadIdx.x + (((gridDim.x * blockIdx.y) + blockIdx.x)*blockDim.x);

    max(i, 0);
    min(i, num_edges);

    int n_start = edges[i].n_start;
    int n_end = edges[i].n_end;

    int map_index = n_start * map->width + n_end;
    map->connections[map_index] = 1;
    printf("%d new value: %d\n", map_index, map->connections[map_index]);
}

adjacency_map_t* adjacency_map_init(int num_nodes, edge_t* edges, int num_edges){
    adjacency_map_t *map;// = (adjacency_map_t*)malloc(sizeof(adjacency_map_t));
    cudaMallocManaged(&map, sizeof(adjacency_map_t));
    cudaMallocManaged(&(map->connections), num_nodes * num_nodes * sizeof(int));
    //map->connections = (int*)malloc(sizeof(int) * num_nodes * num_nodes);

    map->width = num_nodes;
    map->height = num_nodes;

    map->stride = 0;

    //GPU stuff
//    adjacency_map_t *d_map;
//    int* d_connections;

//    cudaMalloc((void**) &d_map, sizeof(adjacency_map_t));
//    cudaMalloc((void**) &d_connections, num_nodes * num_nodes * sizeof(int));

//    cudaMemcpy(d_map, map, sizeof(adjacency_map_t), cudaMemcpyHostToDevice);
//    cudaMemcpy(d_connections, map->connections, num_nodes * num_nodes, cudaMemcpyHostToDevice);
//cudaMemcpy(&(d_map->connections), &d_connections, sizeof(int*), cudaMemcpyHostToDevice);

//    edge_t* d_edges;
//    cudaMalloc((void**) &d_edges, num_edges * sizeof(edge_t));
//    cudaMemcpy(d_edges, edges, num_edges * sizeof(edge_t), cudaMemcpyHostToDevice);

adjacency_map_init_gpu<<<1, 3>>>(map);
cudaDeviceSynchronize();
//adjacency_map_connect_gpu<<<1, 3>>>(edges, num_edges, map);

cudaDeviceSynchronize();

//    cudaMemcpy(map, d_map, sizeof(adjacency_map_t), cudaMemcpyDeviceToHost);
//Synchronize everything
//    cudaFree(map);
//    cudaFree(edges);

return map;

}

Basically, I can access all the elements in the original structure on the host for the second snippet of code. Once I try to launch a kernel function, however, the pointer becomes inaccessible (at least, tested from gdb), and the entire object's data is inaccessible. The only portion of the edges and the map pointer I can still see after the first kernel launch are their respective locations.

Any help would be greatly appreciated! Thanks so much!

Your first code works for me on CUDA 6/RHEL 6.2. That is, I get `result = 2 `printed out. You should add [proper cuda error checking](http://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda-runtime-api) to all cuda API calls and kernel calls. Arch Linux/gcc4.8 doesn't look like [a supported distro combination](http://docs.nvidia.com/cuda/cuda-getting-started-guide-for-linux/index.html#abstract). Start by testing the codes without `gdb` and instead use `cuda-memcheck`. Also provide your exact compile command line. — Robert Crovella, May 14 '14 at 05:17
Recompiled gcc to 4.8 (Arch Linux's version is 4.9 right now). Using a local version of GCC to compile. Here's the command I use: /opt/cuda/bin/nvcc -ccbin /opt/gcc/usr/local/bin/gcc mem.cu -arch=compute_30 -code=sm_30 -g -m64 (mem.cu is the name of the file for the first snippet.) — Matthew Daiter, May 14 '14 at 05:35
Added the error checking you linked me to. Didn't print anything when I wrapped cudaMallocManaged(), cudaPeekAtLastError(), and cudaDeviceSynchronize(). — Matthew Daiter, May 14 '14 at 05:40
I don't know what the problem is. The only suggestion I can offer is to switch to a supported distro. As I indicated already, your code sample (at least the first one) works fine for me when I run it on a supported distro/config. — Robert Crovella, May 14 '14 at 12:14
Hey, could I get your xorg config, your linux version, and your nvidia driver version? — Matthew Daiter, May 15 '14 at 21:04
I mentioned my linux version in my first comment (RHEL 6.2) The driver version is the driver that ships with CUDA 6 linux `.run` installer (331.62). `xorg.conf` is irrelevant to this issue/question. Most of the systems I use do not have X running. — Robert Crovella, May 15 '14 at 21:16

score 1 · Accepted Answer · answered May 16 '14 at 05:58

Got it!

Turns out it was a problem with the IOMMU kernel option enabled. My motherboard, GIGABYTE 990-FXAUD3 seems to have had an error with IOMMU between the GPU and the CPU.

Detection: Whenever you launch Unified Memory accessing code in the console (without X), there should be an error message resembling this:

AMD-Vi: Event logged [IO_PAGE_FAULT device=01:00.0 domain=0x0017 address=0x00000002d80d5000 flags=0x0010]

scrolling down the page. There might also be some discolouration in the top right of the screen (there was for me, at least).

Here's the solution (assuming you use GRUB):

Open /etc/default/grub, and for the line GRUB_CMDLINE_LINUX_DEFAULT="" add the option iommu=soft inside the quotes.

Hope this helps people out! Big thanks to Robert Crovella for helping me narrow down the problem!

user111375 · Answer 2 · 2015-03-12T17:38:46.503

0

I did a similar procedure to the one that Matthew Daiter mentioned. I removed the IOMMU option, but did it from BIOS. And this work perfectly!!

edited Mar 12 '15 at 17:38

answered Mar 09 '15 at 16:55

user111375

1
1

CUDA Unified Memory Working (in specific, cudaMallocManaged();)

2 Answers2