Is there a way to use "unified memory" (MAGMA) with 2 GPU cards with NVLink and 1TB RAM

Question

At work, On Debian 10, I have 2 GPU cards RTX A6000 with NVlink harware component with 1TB of RAM and I would like to benefit of the potential combined power of both cards and 1TB RAM.

Currently, I have the following magma.make invoked by a Makefile :

CXX = nvcc -std=c++17 -O3
LAPACK = /opt/intel/oneapi/mkl/latest
LAPACK_ANOTHER=/opt/intel/mkl/lib/intel64
MAGMA = /usr/local/magma
INCLUDE_CUDA=/usr/local/cuda/include
LIBCUDA=/usr/local/cuda/lib64

SEARCH_DIRS_INCL=-I${MAGMA}/include -I${INCLUDE_CUDA} -I${LAPACK}/include
SEARCH_DIRS_LINK=-L${LAPACK}/lib/intel64 -L${LAPACK_ANOTHER} -L${LIBCUDA} -L${MAGMA}/lib

CXXFLAGS = -c -DMAGMA_ILP64 -DMKL_ILP64 -m64 ${SEARCH_DIRS_INCL}

LDFLAGS = ${SEARCH_DIRS_LINK} -lmkl_intel_lp64 -lmkl_gnu_thread -lmkl_core -lgomp -lcuda -lcudart -lcublas -lmagma -lpthread -lm -ldl 

SOURCES = main_magma.cpp XSAF_C_magma.cpp
EXECUTABLE = main_magma.exe

When I execute my code, I have memory errors since in this code, I try to inverse matrixes of size 120k x 120k.

If we lookt at closer, 120k x 120k matrixes requires in double precision : 120k x 120k x 8 bytes, so alsmost 108GB.

The functions implied can't accept single precision.

Unfortunately, I have 2 NVIDIA GPU cards of 48GB each one :

Question :

Is there a way, from a computation point of view or, from a coding point of view, to merge the 2 memory of 2 GPU cards (that would give 96GB) in order to inverse these large matrixes ?

I am using MAGMA to compile and for routine of inversion like this :

// ROUTINE MAGMA IMPLEMENTED
void matrix_inverse_magma(vector<vector<double>> const &F_matrix, vector<vector<double>> &F_output) {

  // Index for loop and arrays
  int i, j, ip, idx;

  // Start magma part
  magma_int_t m = F_matrix.size();
  if (m) {
  magma_init (); // initialize Magma
  magma_queue_t queue=NULL;
  magma_int_t dev=0;
  magma_queue_create(dev ,&queue );
  double gpu_time , *dwork; // dwork - workspace
  magma_int_t ldwork; // size of dwork
  magma_int_t *piv, info; // piv - array of indices of inter -
  magma_int_t mm=m*m; // size of a, r, c
  double *a; // a- mxm matrix on the host
  double *d_a; // d_a - mxm matrix a on the device
  double *d_c; // d_c - mxm matrix c on the device
 
  magma_int_t ione = 1;
  magma_int_t ISEED [4] = { 0,0,0,1 }; // seed
  magma_int_t err;
  const double alpha = 1.0; // alpha =1
  const double beta = 0.0; // beta=0
  ldwork = m * magma_get_dgetri_nb( m ); // optimal block size
  // allocate matrices
  err = magma_dmalloc_cpu( &a , mm ); // host memory for a

  for (i = 0; i<m; i++){
    for (j = 0; j<m; j++){
      idx = i*m + j;
      a[idx] = F_matrix[i][j];
      //cout << "a[" << idx << "]" << a[idx] << endl;
    }
  }
  err = magma_dmalloc( &d_a , mm ); // device memory for a
  err = magma_dmalloc( &dwork , ldwork );// dev. mem. for ldwork
  piv=( magma_int_t *) malloc(m*sizeof(magma_int_t ));// host mem.

  magma_dsetmatrix( m, m, a, m, d_a, m, queue); // copy a -> d_a

  magma_dgetrf_gpu( m, m, d_a, m, piv, &info);
  magma_dgetri_gpu(m, d_a, m, piv, dwork, ldwork, &info);

  magma_dgetmatrix( m, m, d_a , m, a, m, queue); // copy d_a ->a

  for (i = 0; i<m; i++){
    for (j = 0; j<m; j++){
      idx = i*m + j;
      F_output[i][j] = a[idx];
    }
  }
  // SAVE ORIGINAL
  free(a); // free host memory
  free(piv); // free host memory
  magma_free(d_a); // free device memory
  magma_queue_destroy(queue); // destroy queue
  magma_finalize (); 
  // End magma part
  }
}

If this is not possible to do it directly with NVlink harware component between both GPU cards, which workaround could we find to allow this matrix inversion ?

Edit :

I was told by a HPC engineer :

"The easiest way will be to use the Makefiles until we figure out how cmake can support that. If you do that, you can just replace LAPACKE_dgetrf by magma_dgetrf. MAGMA will use internally one GPU with out-of-memory algorithm that fill factor the matrix, even if it is large and does not fir into the memory of the GPU."

Does it mean that I have to find the appropriate flags of Makefile to be able to use magma_dgetrf instead of LAPACKE_dgetrf ?

And for the second sentence, it is said that

"MAGMA will use internally one GPU with out-of-memory algorithm that fill factor the matrix"

Does it mean that if my matrix is over 48GB, then MAGMA will be able to fill the rest into the second GPU A6000 or in the RAM and perform the inversion of the full matrix ?

Please, let me know which flags to use to build correctly MAGMA in my case.

Currrently, I do :

$ mkdir build && cd build
$ cmake -DUSE_FORTRAN=ON  \
-DGPU_TARGET=Ampere \
-DLAPACK_LIBRARIES="/opt/intel/oneapi/intelpython/latest/lib/liblapack.so" \
-DMAGMA_ENABLE_CUDA=ON ..
$ cmake --build . --config Release

You already know you don't have enough memory even with both cards put together. — user2357112, Dec 27 '21 at 08:16
Do you realise that to inverse such a matrix, it will take an average PC like 20 days? Add to it very likely numeric instability causes by typical algorithms and instead of inverse you'll get random noise. You literally need specialised algorithms to even begin working with such matrices. — ALX23z, Dec 27 '21 at 08:39
@user2357112supportsMonica . Could you take a look please at my **Edit** ? — , Dec 27 '21 at 09:10
This question is hard to answer when we don't know anything about your matrices other then their size. If most of the elements are zero/very close to zero, you could try using sparse matrices instead to get an approximation which could be good enough for your usage. — Kaldrr, Dec 27 '21 at 09:14
@Kaldrr . Unfortunately, my matrices are not sparse matrices. — , Dec 27 '21 at 09:16
@njuffa . Thanks for your link. it is a way to breakdown the large size. However, I am interesting in being able to merge the memory of my 2 GPU cards with MAGMA if possible and given the fact that I have NVlink hardware component. Have you got some links or informations about this merging ? Best regards — , Dec 27 '21 at 19:50
You can look for using unified memory with Magma to access the memory of other GPUs and the RAM of the CPU. I would look into breaking down / partitioniong the matrix and perhaps there are iterative approaches, which are faster (e.g. you could start with single precision). — Sebastian, Dec 28 '21 at 12:17
"If this is not possible to do it directly with NVlink harware component between both GPU cards" -- you already know this because your alter-ego already asked that question and accepted [this answer](https://stackoverflow.com/a/69863960/681865) explaining exactly why that is not possible — talonmies, Jan 01 '22 at 03:10

score 3 · Answer 1 · answered Jan 04 '22 at 14:58

I am not an expert in GP/GPU computation, but I would be very surprised if you could combine two compute devices into a single device. At least I don't think it's possible using a standard library. If you think about it, it sort of defeats the purpose of using a GPU in the first place.

However, I would say that once you use very large matrices you hit many problems, which make a text-book inverse operation numerically unstable. The normal way around this is instead to never store an inverse matrix, at all. Often you only require an inverse matrix to be able to solve

Ax = b (solve for x)
Ax - b = 0 (homogenous form)

Which can be solved without inverse-A

I would suggest that you need to start by reading the inverse-matrix chapter of Numerical Recipes in C/C++. This is a standard text, with example code, and is widely available from Amazon, etc. These texts assume CPU implementation, but...

Once you understand these algorithms, you may (or may not) find that being able to issue two parallel non-inverse matrix operations is useful to you. However the algorithms described in this (and other texts) are orders of magnitude faster than any brute force operation anyway.

So your solution is "write your own using Numerical Methods(!)", rather than use an existing library which already has GPU accelerated routines for all of the core operations necessary to implement an out-of-core factorization over multiple devices? And what do you mean by "any brute force operation"? What is brute force about an off-the-shelf GPU accelerated LAPACK implementation? — talonmies, Jan 07 '22 at 03:50
The inverse operation for general matrices is the brute force operation. If faster (simpler) matrix operations are sufficient and there are ways to divide the full 10 GB matrix into smaller pieces, then a library could be used for those parts. Also there are algorithms to do double calculations with float. As @ALX23z said, running it with libraries like Lapack would take weeks (if the PC has 256GB of RAM for 216GB for 2 matrices). If there are numerical instabilities in the representation as @ Tiger4Hire says, understanding about the algorithmic options and what they do would help even more. — Sebastian, Jan 07 '22 at 07:00

Is there a way to use "unified memory" (MAGMA) with 2 GPU cards with NVLink and 1TB RAM

Question :

Edit :

1 Answers1