Multidimensional array allocation with Cuda Unified Memory on Power 8

Question

I'm trying to allocate multi dimensional arrays by using CUDA UMA on Power 8 system. However, I'm having issue while size is getting bigger. The code I'm using is below. When size is 24 x 24 x 24 x 5 works fine. When I increase it to 64 x 64 x 64 x 8 I am having " out of memory" even though I have memory in my device. Afaik, I suppose to be able to allocate memory via UMA as much as GPU device physical memory. So I would not expect any error. Currently my main configuration is Power 8 and Tesla k40 where I am having seg fault during runtime. However, I tried the code piece I provided on x86 + k40 machine. It surprisingly worked.

BTW, if you tell me another way to do that apart from transforming all my code from 4d array to 1d array, I'll so appreciate.

Thanks in advance

Driver: Nvidia 361

#include <iostream>
#include <cuda_runtime.h>

void* operator new[] (size_t len) throw(std::bad_alloc) {
    void *ptr;
    cudaMallocManaged(&ptr, len);
    return ptr;
}

template<typename T>
T**** create_4d(int a, int b, int c, int d){
    T**** ary = new T***[a];
    for(int i = 0; i < a; ++i)
    {
        ary[i] = new T**[b];
        for(int j = 0; j < b; ++j){
            ary[i][j] = new T*[c];
            for(int k = 0; k < c; ++k){
                ary[i][j][k] = new T[d];
            }
        }
    }
    return ary;
}

int main() {
   double ****data;

   std::cout << "allocating..." << std::endl;
   data = create_4d<double>(32,65,65,5);
   std::cout << "Hooreey !!!" << std::endl;

   //segfault here
   std::cout << "allocating..." << std::endl;
   data = create_4d<double>(64,65,65,5);
   std::cout << "Hooreey !!!" << std::endl;   

   return 0;
}

"segmentation fault" sounds like an error in the *host* code. What version of CUDA? Are you using the latest CUDA driver? Do the release notes mention any UMA restrictions on Power? How much system memory in your Power and x84 systems, respectively? Given that it works on x64, filing a bug with NVIDIA seems reasonable. Note: your data structure is not actually a contiguous 4D matrix, it is really a four-level *tree*, which will be inefficient on both CPUs and GPUs (every element access requires dereferencing multiple pointers, for four memory accesses in total). *Not* recommended. — njuffa, Nov 02 '16 at 20:36
My configuration is IBM Power S824L sockets, 12-core/socket, 8-hw threads/core at 3.52 GHz and 1 TB memory. And each socket has Tesla K40m. I caught error propoerly. please look at main question — grypp, Nov 03 '16 at 13:44
The comment about "segmentation fault" implying an error in the host code does not suggest incorrect error handling in your code: this could be due to an error in the host portion of the CUDA driver or CUDA runtime. I would suggest filing a bug report with NVIDIA: there are unlikely to be many people who can even attempt to reproduce your issue because very few use CUDA on a hardware platform similar to yours. — njuffa, Nov 03 '16 at 16:26

Robert Crovella · Answer 1 · 2019-01-05T15:34:04.550

There's been a considerable amount of dialog on your cross-posting here including an answer to your main question. I'll use this answer to summarize what is there as well as to answer this question specifically:

BTW, if you tell me another way to do that apart from transforming all my code from 4d array to 1d array, I'll so appreciate.

One of your claims is that you are doing proper error checking (" I caught error propoerly."). You are not. CUDA runtime API calls (including cudaMallocManaged) by themselves do not generate C++ style exceptions, so your throw specification on the new operator definition is meaningless. CUDA runtime API calls return an error code. If you want to do proper error checking, you must collect this error code and process it. If you collect the error code, you can use it to generate an exception if you wish, and an example of how you might do that is contained in the canonical proper CUDA error checking question, as one of the answers by Jared Hoberock. As a result of this oversight, when your allocations eventually fail, you are ignoring this, and then when you attempt to use those (non-) allocated areas for subsequent pointer storage, you generate a seg fault.
The proximal reason for the allocation failure is that you are in fact running out of memory, as discussed in your cross-posting. You can confirm this easily enough with proper error checking. Managed allocations have a granularity, and so when you request allocations of relatively small amounts, you are in fact using more memory than you think - the small allocations you are requesting are each being rounded up to the allocation granularity. The size of the allocation granularity varies by system type, and so the OpenPower system you are operating on has a much larger allocation granularity than the x86 system you compared it to, and as a result you were not running out of memory on the x86 system, but you were on the Power system. As discussed in your cross-posting, this is easy to verify with strategic calls to cudaMemGetInfo.

From a performance perspective, this is a pretty bad approach to multidimensional allocations for several reasons:

The allocations you are creating are disjoint, connected by pointers. Therefore, to access an element by pointer dereferencing, it requires 3 or 4 such dereferences to go through a 4-subscripted pointer array. Each of these dereferences will involve a device memory access. Compared to using simulated 4-D access into a 1-D (flat) allocation, this will be noticeably slower. The arithmetic associated with converting the 4-D simulated access into a single linear index will be much faster than traversing through memory via pointer-chasing.
Since the allocations you are creating are disjoint, the managed memory subsystem cannot coalesce them into a single transfer, and therefore, under the hood, a number of transfers equal to the product of your first 3 dimensions will take place, at kernel launch time (and presumably at termination, ie. at the next cudaDeviceSynchronize() call). This data must all be transferred of course, but you will be doing a large number of very small transfers, compared to a single transfer for a "flat" allocation. The associated overhead of the large number of small transfers can be significant.
As we've seen, the allocation granularity can seriously impact the memory usage efficiency of such an allocation scheme. What should be only using a small percentage of system memory ends up using all of system memory.
Operations that work on contiguous data from "row" to "row" of such an allocation will fail, because the allocations are disjoint. For example, such a matrix or a subsection of such a matrix could not be reliably passed to a CUBLAS linear algebra routine, as the expectation for that matrix would have contiguity of row storage in memory associated with it.

The ideal solution would be to create a single flat allocation, and then use simulated 4-D indexing to create a single linear index. Such an approach would address all 4 concerns above. However it requires perhaps substantial code refactoring.

We can however come up with an alternate approach, which preserves the 4-subscripted indexing, but otherwise addresses the concerns in items 2, 3, and 4 above by creating a single underlying flat allocation.

What follows is a worked example. We will actually create 2 managed allocations: one underlying flat allocation for data storage, and one underlying flat allocation (regardless of dimensionality) for pointer storage. It would be possible to combine these two into a single allocation with some careful alignment work, but that is not required to achieve any of the proposed benefits.

The basic methodology is covered in various other CUDA questions here on the SO tag, but most of those have host-side usage (only) in view, since they did not have UM in view. However, UM allows us to extend the methodology to host- and device-side usage. We will start by creating a single "base" allocation of the necessary size to store the data. Then we will create an allocation for the pointer array, and we will then work through the pointer array, fixing up each pointer to point to the correct location in the pointer array, or else to the correct location in the "base" data array.

Here's a worked example, demonstrating host and device usage, and including proper error checking:

$ cat t1271.cu
#include <iostream>
#include <assert.h>

template<typename T>
T**** create_4d_flat(int a, int b, int c, int d){
    T *base;
    cudaError_t err = cudaMallocManaged(&base, a*b*c*d*sizeof(T));
    assert(err == cudaSuccess);
    T ****ary;
    err = cudaMallocManaged(&ary, (a+a*b+a*b*c)*sizeof(T*));
    assert(err == cudaSuccess);
    for (int i = 0; i < a; i++){
      ary[i] =  (T ***)((ary + a) + i*b);
      for (int j = 0; j < b; j++){
        ary[i][j] = (T **)((ary + a + a*b) + i*b*c + j*c);
        for (int k = 0; k < c; k++)
          ary[i][j][k] = base + ((i*b+j)*c + k)*d;}}
    return ary;
}

template<typename T>
void free_4d_flat(T**** ary){
    if (ary[0][0][0]) cudaFree(ary[0][0][0]);
    if (ary) cudaFree(ary);
}


template<typename T>
__global__ void fill(T**** data, int a, int b, int c, int d){
  unsigned long long int val = 0;
  for (int i = 0; i < a; i++)
    for (int j = 0; j < b; j++)
      for (int k = 0; k < c; k++)
        for (int l = 0; l < d; l++)
          data[i][j][k][l] = val++;
}

void report_gpu_mem()
{
    size_t free, total;
    cudaMemGetInfo(&free, &total);
    std::cout << "Free = " << free << " Total = " << total <<std::endl;
}

int main() {
   report_gpu_mem();

   unsigned long long int ****data2;
   std::cout << "allocating..." << std::endl;
   data2 = create_4d_flat<unsigned long long int>(64, 63, 62, 5);

   report_gpu_mem();

   fill<<<1,1>>>(data2, 64, 63, 62, 5);
   cudaError_t err = cudaDeviceSynchronize();
   assert(err == cudaSuccess);

   std::cout << "validating..." << std::endl;
   for (int i = 0; i < 64*63*62*5; i++)
     if (*(data2[0][0][0] + i) != i) {std::cout << "mismatch at "  << i << " was " << *(data2[0][0][0] + i) << std::endl; return -1;}
   free_4d_flat(data2);
   return 0;
}
$ nvcc -arch=sm_35 -o t1271 t1271.cu
$ cuda-memcheck ./t1271
========= CUDA-MEMCHECK
Free = 5904859136 Total = 5975900160
allocating...
Free = 5892276224 Total = 5975900160
validating...
========= ERROR SUMMARY: 0 errors
$

Notes:

This still involves pointer chasing inefficiency. I don't know of a method to avoid that without removing the multiple subscript arrangement.
I've elected to use 2 different indexing schemes in host and device code. In device code, I am using a normal 4-subscripted index, to demonstrate the utility of that. In host code, I am using a "flat" index, to demonstrate that the underlying storage is contiguous and contiguously addressable.

Multidimensional array allocation with Cuda Unified Memory on Power 8

1 Answers1