At work, On Debian 10, I have 2 GPU cards RTX A6000 with NVlink harware component with 1TB of RAM and I would like to benefit of the potential combined power of both cards and 1TB RAM.
Currently, I have the following magma.make invoked by a Makefile :
CXX = nvcc -std=c++17 -O3
LAPACK = /opt/intel/oneapi/mkl/latest
LAPACK_ANOTHER=/opt/intel/mkl/lib/intel64
MAGMA = /usr/local/magma
INCLUDE_CUDA=/usr/local/cuda/include
LIBCUDA=/usr/local/cuda/lib64
SEARCH_DIRS_INCL=-I${MAGMA}/include -I${INCLUDE_CUDA} -I${LAPACK}/include
SEARCH_DIRS_LINK=-L${LAPACK}/lib/intel64 -L${LAPACK_ANOTHER} -L${LIBCUDA} -L${MAGMA}/lib
CXXFLAGS = -c -DMAGMA_ILP64 -DMKL_ILP64 -m64 ${SEARCH_DIRS_INCL}
LDFLAGS = ${SEARCH_DIRS_LINK} -lmkl_intel_lp64 -lmkl_gnu_thread -lmkl_core -lgomp -lcuda -lcudart -lcublas -lmagma -lpthread -lm -ldl
SOURCES = main_magma.cpp XSAF_C_magma.cpp
EXECUTABLE = main_magma.exe
When I execute my code, I have memory errors since in this code, I try to inverse matrixes of size 120k x 120k
.
If we lookt at closer, 120k x 120k matrixes requires in double precision : 120k x 120k x 8 bytes, so alsmost 108GB.
The functions implied can't accept single precision.
Unfortunately, I have 2 NVIDIA GPU cards of 48GB each one :
Question :
Is there a way, from a computation point of view or, from a coding point of view, to merge the 2 memory of 2 GPU cards (that would give 96GB) in order to inverse these large matrixes ?
I am using MAGMA
to compile and for routine of inversion like this :
// ROUTINE MAGMA IMPLEMENTED
void matrix_inverse_magma(vector<vector<double>> const &F_matrix, vector<vector<double>> &F_output) {
// Index for loop and arrays
int i, j, ip, idx;
// Start magma part
magma_int_t m = F_matrix.size();
if (m) {
magma_init (); // initialize Magma
magma_queue_t queue=NULL;
magma_int_t dev=0;
magma_queue_create(dev ,&queue );
double gpu_time , *dwork; // dwork - workspace
magma_int_t ldwork; // size of dwork
magma_int_t *piv, info; // piv - array of indices of inter -
magma_int_t mm=m*m; // size of a, r, c
double *a; // a- mxm matrix on the host
double *d_a; // d_a - mxm matrix a on the device
double *d_c; // d_c - mxm matrix c on the device
magma_int_t ione = 1;
magma_int_t ISEED [4] = { 0,0,0,1 }; // seed
magma_int_t err;
const double alpha = 1.0; // alpha =1
const double beta = 0.0; // beta=0
ldwork = m * magma_get_dgetri_nb( m ); // optimal block size
// allocate matrices
err = magma_dmalloc_cpu( &a , mm ); // host memory for a
for (i = 0; i<m; i++){
for (j = 0; j<m; j++){
idx = i*m + j;
a[idx] = F_matrix[i][j];
//cout << "a[" << idx << "]" << a[idx] << endl;
}
}
err = magma_dmalloc( &d_a , mm ); // device memory for a
err = magma_dmalloc( &dwork , ldwork );// dev. mem. for ldwork
piv=( magma_int_t *) malloc(m*sizeof(magma_int_t ));// host mem.
magma_dsetmatrix( m, m, a, m, d_a, m, queue); // copy a -> d_a
magma_dgetrf_gpu( m, m, d_a, m, piv, &info);
magma_dgetri_gpu(m, d_a, m, piv, dwork, ldwork, &info);
magma_dgetmatrix( m, m, d_a , m, a, m, queue); // copy d_a ->a
for (i = 0; i<m; i++){
for (j = 0; j<m; j++){
idx = i*m + j;
F_output[i][j] = a[idx];
}
}
// SAVE ORIGINAL
free(a); // free host memory
free(piv); // free host memory
magma_free(d_a); // free device memory
magma_queue_destroy(queue); // destroy queue
magma_finalize ();
// End magma part
}
}
If this is not possible to do it directly with NVlink harware component between both GPU cards, which workaround could we find to allow this matrix inversion ?
Edit :
I was told by a HPC engineer :
"The easiest way will be to use the Makefiles until we figure out how cmake can support that. If you do that, you can just replace LAPACKE_dgetrf by magma_dgetrf. MAGMA will use internally one GPU with out-of-memory algorithm that fill factor the matrix, even if it is large and does not fir into the memory of the GPU."
Does it mean that I have to find the appropriate flags of Makefile to be able to use magma_dgetrf instead of LAPACKE_dgetrf ?
And for the second sentence, it is said that
"MAGMA will use internally one GPU with out-of-memory algorithm that fill factor the matrix"
Does it mean that if my matrix is over 48GB, then MAGMA will be able to fill the rest into the second GPU A6000 or in the RAM and perform the inversion of the full matrix ?
Please, let me know which flags to use to build correctly MAGMA in my case.
Currrently, I do :
$ mkdir build && cd build
$ cmake -DUSE_FORTRAN=ON \
-DGPU_TARGET=Ampere \
-DLAPACK_LIBRARIES="/opt/intel/oneapi/intelpython/latest/lib/liblapack.so" \
-DMAGMA_ENABLE_CUDA=ON ..
$ cmake --build . --config Release