How to perform PyCUDA 4x4 matrix inversion with same accuracy than numpy linalg "inv" or "pinv" function

Question

I am facing an issue of accuracy about my code which performs a number (128, 256, 512) of 4x4 matrix inversions. When I use the original version, i.e the numpy function np.linalg.inv or np.linalg.pinv, everything works fine.

Unfortunately, with the CUDA code below, I get nan and inf values into inverted matrix.

To be more explicit, I take this matrix to invert :

2.120771107884677649e+09 0.000000000000000000e+00 0.000000000000000000e+00 0.000000000000000000e+00
0.000000000000000000e+00 3.557266600921528288e+27 3.557266600921528041e+07 3.557266600921528320e+17
0.000000000000000000e+00 3.557266600921528041e+07 3.557266600921528288e+27 3.557266600921528041e+07
0.000000000000000000e+00 3.557266600921528320e+17 3.557266600921528041e+07 1.778633300460764144e+27

If I use classical numpy "inv", I get for the following inverted 3x3 matrix :

4.715266047722758306e-10 0.000000000000000000e+00 0.000000000000000000e+00 0.000000000000000000e+00
0.000000000000000000e+00 2.811147187396482366e-28 -2.811147186834252285e-48 -5.622294374792964645e-38
0.000000000000000000e+00 -2.811147186834252285e-48 2.811147187396482366e-28 -5.622294374230735768e-48
0.000000000000000000e+00 -5.622294374792964645e-38 -5.622294374230735768e-48 5.622294374792964732e-28

To check the validity of this inverse matrix, I have multiplied it by original matrix and the result is the identity matrix.

But with CUDA GPU inversion, I get after the inversion this matrix :

0.000000000000000000e+00 0.000000000000000000e+00 0.000000000000000000e+00 0.000000000000000000e+00
0.000000000000000000e+00 0.000000000000000000e+00 0.000000000000000000e+00 0.000000000000000000e+00
-inf -inf -9.373764907941219970e-01 -inf
inf nan -inf nan

So, I woul like to increase the precision into my CUDA kernel or python code to avoid these nanand inf values.

Here is the CUDA kernel code and calling part of my main code (I have commented the classical method with numpy inv function :

    # Create arrayFullCross_vec array
    arrayFullCross_vec = np.zeros((dimBlocks,dimBlocks,integ_prec,integ_prec))

    # Create arrayFullCross_vec array
    invCrossMatrix_gpu = np.zeros((dimBlocks*dimBlocks*integ_prec**2))
 
    # Create arrayFullCross_vec array
    invCrossMatrix = np.zeros((dimBlocks,dimBlocks,integ_prec,integ_prec))

    # Build observables covariance matrix
    arrayFullCross_vec = buildObsCovarianceMatrix4_vec(k_ref, mu_ref, ir)
    """        
    # Compute integrand from covariance matrix
    for r_p in range(integ_prec):
      for s_p in range(integ_prec):
        # original version (without GPU)
        invCrossMatrix[:,:,r_p,s_p] = np.linalg.inv(arrayFullCross_vec[:,:,r_p,s_p])
    """
    # GPU version
    invCrossMatrix_gpu = gpuinv4x4(arrayFullCross_vec.flatten(),integ_prec**2)
    invCrossMatrix = invCrossMatrix_gpu.reshape(dimBlocks,dimBlocks,integ_prec,integ_prec)
    """

and here the CUDA kernel code and gpuinv4x4 function :

kernel = SourceModule("""

__device__ unsigned getoff(unsigned &off){
  unsigned ret = off & 0x0F;
  off = off >> 4;
  return ret;
}

const int block_size = 256;
const unsigned tmsk = 0xFFFFFFFF;
// in-place is acceptable i.e. out == in)
// T = double or double only
typedef double T;
__global__ void inv4x4(const T * __restrict__ in, T * __restrict__ out, const size_t n, const unsigned * __restrict__ pat){

  __shared__ T si[block_size];
  size_t idx = threadIdx.x+blockDim.x*blockIdx.x;
  if (idx < n*16){
    si[threadIdx.x] = in[idx];
    unsigned lane = threadIdx.x & 15;
    unsigned sibase = threadIdx.x & 0x03F0;
    __syncwarp();
    unsigned off = pat[lane];
    T a,b;
    a  = si[sibase + getoff(off)];
    a *= si[sibase + getoff(off)];
    a *= si[sibase + getoff(off)];
    if (!getoff(off)) a = -a;
    b  = si[sibase + getoff(off)];
    b *= si[sibase + getoff(off)];
    b *= si[sibase + getoff(off)];
    if (getoff(off)) a += b;
    else a -=b;
    off = pat[lane+16];
    b  = si[sibase + getoff(off)];
    b *= si[sibase + getoff(off)];
    b *= si[sibase + getoff(off)];
    if (getoff(off)) a += b;
    else a -=b;
    b  = si[sibase + getoff(off)];
    b *= si[sibase + getoff(off)];
    b *= si[sibase + getoff(off)];
    if (getoff(off)) a += b;
    else a -=b;
    off = pat[lane+32];
    b  = si[sibase + getoff(off)];
    b *= si[sibase + getoff(off)];
    b *= si[sibase + getoff(off)];
    if (getoff(off)) a += b;
    else a -=b;
    b  = si[sibase + getoff(off)];
    b *= si[sibase + getoff(off)];
    b *= si[sibase + getoff(off)];
    if (getoff(off)) a += b;
    else a -=b;
    T det = si[sibase + (lane>>2)]*a;
    det += __shfl_down_sync(tmsk, det, 4, 16); // first add
    det += __shfl_down_sync(tmsk, det, 8, 16); // second add
    det =  __shfl_sync(tmsk, det, 0, 16); // broadcast
    out[idx] = a / det;
  }
}
""")

# python function for inverting 4x4 matrices
# n should be an even number
def gpuinv4x4(inp, n):
    # internal constants not to be modified
    hpat = ( 0x0EB51FA5, 0x1EB10FA1, 0x0E711F61, 0x1A710B61, 0x1EB40FA4, 0x0EB01FA0, 0x1E700F60, 0x0A701B60, 0x0DB41F94, 0x1DB00F90, 0x0D701F50, 0x19700B50, 0x1DA40E94, 0x0DA01E90, 0x1D600E50, 0x09601A50, 0x1E790F69, 0x0E391F29, 0x1E350F25, 0x0A351B25, 0x0E781F68, 0x1E380F28, 0x0E341F24, 0x1A340B24, 0x1D780F58, 0x0D381F18, 0x1D340F14, 0x09341B14, 0x0D681E58, 0x1D280E18, 0x0D241E14, 0x19240A14, 0x0A7D1B6D, 0x1A3D0B2D, 0x063D172D, 0x16390729, 0x1A7C0B6C, 0x0A3C1B2C, 0x163C072C, 0x06381728, 0x097C1B5C, 0x193C0B1C, 0x053C171C, 0x15380718, 0x196C0A5C, 0x092C1A1C, 0x152C061C, 0x05281618)
    # Convert parameters into numpy array
    # float32 
    """
    inpd = np.array(inp, dtype=np.float32)
    hpatd = np.array(hpat, dtype=np.uint32)
    output = np.empty((n*16), dtype= np.float32)
    """
    # float64
    """
    inpd = np.array(inp, dtype=np.float64)
    hpatd = np.array(hpat, dtype=np.uint32)
    output = np.empty((n*16), dtype= np.float64)
    """
    # float128
    inpd = np.array(inp, dtype=np.float128)
    hpatd = np.array(hpat, dtype=np.uint32)
    output = np.empty((n*16), dtype= np.float128)
    # Get kernel function
    matinv4x4 = kernel.get_function("inv4x4")
    # Define block, grid and compute
    blockDim = (256,1,1) # do not change
    gridDim = ((n/16)+1,1,1)
    # Kernel function
    matinv4x4 (
        cuda.In(inpd), cuda.Out(output), np.uint64(n), cuda.In(hpatd),
        block=blockDim, grid=gridDim)
    return output

As you can see, I tried to increase accuracy of inverting operation by replacing np.float32 by np.float64 or np.float128 but problem remains.

I have also replaced typedef float T; by typedef double T;but without success.

How can I perform the right inversion of these matrices and mostly avoid the 'nan' and 'inf' values ? I think this is a real issue of precision but I can't find how to circumvent this problem.

it looks like you have an ugly matrix, possibly non-convertible. Even the results you are supposed to get look really bad. — Ander Biguri, Mar 27 '19 at 21:39
@AnderBiguri . I have updated my post by showing the 3x3 matrix to invert. the function `np.linalg.inv` computes a right inversion but PyCuda doesn't produce the same result. So maybe my 3x3 matrix to invert is ugly but `np.linalg.inv` manages to invert it. If you had some advices to avoid this issue ... ? regards — , Mar 27 '19 at 23:06
If have honestly not even a slight idea what your CUDA code does. Does it at least produce correct results for matrices with low(er) condition numbers? And please decide whether you want to invert a 3x3 or a 4x4 matrix. I would say you do the latter one. — BlameTheBits, Mar 27 '19 at 23:47
@Shadow . I just want to get a right computation for the matrix that I cited at the begin of the post. You can see that `np.linalg.inv' gives a right calculation unlikely to my gpu Matrix 4x4 inverse function (using kernel code). I want to invert 4x4 matrix. Thanks — , Mar 27 '19 at 23:55
Ah, I just found out that this question has a history. You should mention that. And now as you have stated that you want to invert 4x4 matrices you should edit you question accordingly. Also, imho you should really think about how important the speed advantage of Robert Crovellas code is to you. I would suggest you to implement your own kernel that you fully understand, even if it means to do those inversions slower. (512 inversions really is no high number anyway) — BlameTheBits, Mar 28 '19 at 00:14
according to my testing, the only thing that is needed is to properly convert the code to use `double` instead of `float`. — Robert Crovella, Mar 28 '19 at 00:38
@RobertCrovella . Numpy in host code doesn't have `double` type . I have already replaced `float` by `double` into kernel code. What do you mean by "properly convert the code to use `double` instead of `float`" ? I mean, what are the other modifications I have to do ? — , Mar 28 '19 at 00:47
The entire premise of this question is mathematically illiterate. You are getting overflow/underflow because of the magnitude of the matrix entries. Just normalize the matrix by its largest entry, then invert, then re-scale the resulting inverse matrix by the same factor if it is required — talonmies, Mar 29 '19 at 07:14

score 4 · Accepted Answer · answered Mar 28 '19 at 01:01

This question has previous related questions here and (to a lesser extent) here. It's unclear to me why the title of the question refers to 3x3, the bolded text in the question refers to to 3x3, but the presented problem is a 4x4 matrix inverse (and as stated to OP previously, this code can only be used to invert 4x4 matrices). I will proceed under the assumption that the example case is the desired case.

According to my testing, the only thing that is necessary is to take the previous answer code and convert it to use double (or, in pycuda, float64) rather than float (or in pycuda, float32). I think this should be evident because the example matrix values exceed the range of a float32 type. Here is a worked example:

$ cat t10.py
import numpy as np
# import matplotlib.pyplot as plt
import pycuda.driver as cuda
from pycuda.compiler import SourceModule
import pycuda.autoinit
# kernel
kernel = SourceModule("""

__device__ unsigned getoff(unsigned &off){
  unsigned ret = off & 0x0F;
  off = off >> 4;
  return ret;
}

const int block_size = 256;
const unsigned tmsk = 0xFFFFFFFF;
// in-place is acceptable i.e. out == in)
// T = float or double only
typedef double T;  // *** change this typedef to convert between float and double
__global__ void inv4x4(const T * __restrict__ in, T * __restrict__ out, const size_t n, const unsigned * __restrict__ pat){

  __shared__ T si[block_size];
  size_t idx = threadIdx.x+blockDim.x*blockIdx.x;
  if (idx < n*16){
    si[threadIdx.x] = in[idx];
    unsigned lane = threadIdx.x & 15;
    unsigned sibase = threadIdx.x & 0x03F0;
    __syncwarp();
    unsigned off = pat[lane];
    T a,b;
    a  = si[sibase + getoff(off)];
    a *= si[sibase + getoff(off)];
    a *= si[sibase + getoff(off)];
    if (!getoff(off)) a = -a;
    b  = si[sibase + getoff(off)];
    b *= si[sibase + getoff(off)];
    b *= si[sibase + getoff(off)];
    if (getoff(off)) a += b;
    else a -=b;
    off = pat[lane+16];
    b  = si[sibase + getoff(off)];
    b *= si[sibase + getoff(off)];
    b *= si[sibase + getoff(off)];
    if (getoff(off)) a += b;
    else a -=b;
    b  = si[sibase + getoff(off)];
    b *= si[sibase + getoff(off)];
    b *= si[sibase + getoff(off)];
    if (getoff(off)) a += b;
    else a -=b;
    off = pat[lane+32];
    b  = si[sibase + getoff(off)];
    b *= si[sibase + getoff(off)];
    b *= si[sibase + getoff(off)];
    if (getoff(off)) a += b;
    else a -=b;
    b  = si[sibase + getoff(off)];
    b *= si[sibase + getoff(off)];
    b *= si[sibase + getoff(off)];
    if (getoff(off)) a += b;
    else a -=b;
    T det = si[sibase + (lane>>2)]*a;
    det += __shfl_down_sync(tmsk, det, 4, 16); // first add
    det += __shfl_down_sync(tmsk, det, 8, 16); // second add
    det =  __shfl_sync(tmsk, det, 0, 16); // broadcast
    out[idx] = a / det;
  }
}

""")
# host code
def gpuinv4x4(inp, n):
    # internal constants not to be modified
    hpat = ( 0x0EB51FA5, 0x1EB10FA1, 0x0E711F61, 0x1A710B61, 0x1EB40FA4, 0x0EB01FA0, 0x1E700F60, 0x0A701B60, 0x0DB41F94, 0x1DB00F90, 0x0D701F50, 0x19700B50, 0x1DA40E94, 0x0DA01E90, 0x1D600E50, 0x09601A50, 0x1E790F69, 0x0E391F29, 0x1E350F25, 0x0A351B25, 0x0E781F68, 0x1E380F28, 0x0E341F24, 0x1A340B24, 0x1D780F58, 0x0D381F18, 0x1D340F14, 0x09341B14, 0x0D681E58, 0x1D280E18, 0x0D241E14, 0x19240A14, 0x0A7D1B6D, 0x1A3D0B2D, 0x063D172D, 0x16390729, 0x1A7C0B6C, 0x0A3C1B2C, 0x163C072C, 0x06381728, 0x097C1B5C, 0x193C0B1C, 0x053C171C, 0x15380718, 0x196C0A5C, 0x092C1A1C, 0x152C061C, 0x05281618)
    # Convert parameters into numpy array
    # *** change next line between float32 and float64 to match float or double
    inpd = np.array(inp, dtype=np.float64)
    hpatd = np.array(hpat, dtype=np.uint32)
    # *** change next line between float32 and float64 to match float or double
    output = np.empty((n*16), dtype= np.float64)
    # Get kernel function
    matinv4x4 = kernel.get_function("inv4x4")
    # Define block, grid and compute
    blockDim = (256,1,1) # do not change
    gridDim = ((n/16)+1,1,1)
    # Kernel function
    matinv4x4 (
        cuda.In(inpd), cuda.Out(output), np.uint64(n), cuda.In(hpatd),
        block=blockDim, grid=gridDim)
    return output

inp = (1.0, 1.0, 1.0, 0.0, 0.0, 3.0, 1.0, 2.0, 2.0, 3.0, 1.0, 0.0, 1.0, 0.0, 2.0, 1.0, 2.120771107884677649e+09, 0.0, 0.0, 0.0, 0.0, 3.557266600921528288e+27, 3.557266600921528041e+07, 3.557266600921528320e+17, 0.0, 3.557266600921528041e+07, 3.557266600921528288e+27, 3.557266600921528041e+07, 0.0, 3.557266600921528320e+17, 3.557266600921528041e+07, 1.778633300460764144e+27)
n = 2
result = gpuinv4x4(inp, n)
print(result)
$ python t10.py
[ -3.00000000e+00  -5.00000000e-01   1.50000000e+00   1.00000000e+00
   1.00000000e+00   2.50000000e-01  -2.50000000e-01  -5.00000000e-01
   3.00000000e+00   2.50000000e-01  -1.25000000e+00  -5.00000000e-01
  -3.00000000e+00  -0.00000000e+00   1.00000000e+00   1.00000000e+00
   4.71526605e-10   0.00000000e+00   0.00000000e+00   0.00000000e+00
   0.00000000e+00   2.81114719e-28  -2.81114719e-48  -5.62229437e-38
   0.00000000e+00  -2.81114719e-48   2.81114719e-28  -5.62229437e-48
   0.00000000e+00  -5.62229437e-38  -5.62229437e-48   5.62229437e-28]
$

Apart from updating the input matrix to use the one supplied in this question, note that only 3 lines of code were changed from the previous answer to convert from 32-bit floating point to 64-bit floating point. Each of those 3 lines are marked by a comment that contains triple-asterisk (***) in it.

Also, if you are concerned about accuracy beyond the first 9 or so digits displayed here, I haven't investigated that. It may be that this code is not numerically suitable for every case or need.

As an aside, the code in the question also shows a float128 type in the python section. There is no native CUDA type that corresponds to that.

How to perform PyCUDA 4x4 matrix inversion with same accuracy than numpy linalg "inv" or "pinv" function

1 Answers1

Linked