6

I'm writing a library where I want to have some basic NxN matrix functionality that doesn't have any dependencies and it is a bit of a learning project. I'm comparing my performance to Eigen. I've been able to be pretty equal and even beat its performance on a couple front with SSE2 and with AVX2 beat it on quite a few fronts (it only uses SSE2 so not super surprising).

My issue is I'm using Gaussian Elimination to create an Upper Diagonalized matrix then multiplying the diagonal to get the determinant.I beat Eigen for N < 300 but after that Eigen blows me away and it just gets worse as the matrices get bigger. Given all the memory is accessed sequentially and the compiler dissassembly doesn't look terrible I don't think it is an optimization issue.

There is more optimization that can be done but the timings look much more like an algorithmic timing complexity issue or there is a major SSE advantage I'm not seeing. Simply unrolling the loops a bit hasn't done much for me when trying that.

Is there a better algorithm for calculating determinants?

Scalar code

/*
    Warning: Creates Temporaries!
*/
template<typename T, int ROW, int COLUMN> MML_INLINE T matrix<T, ROW, COLUMN>::determinant(void) const
{
    /*
    This method assumes square matrix
    */
    assert(row() == col());
    /*
    We need to create a temporary
    */
    matrix<T, ROW, COLUMN> temp(*this);
    /*We convert the temporary to upper triangular form*/
    uint N = row();
    T det = T(1);
    for (uint c = 0; c < N; ++c)
    {
         det = det*temp(c,c);
        for (uint r = c + 1; r < N; ++r)
        {
            T ratio = temp(r, c) / temp(c, c);
            for (uint k = c; k < N; k++)
            {
                temp(r, k) = temp(r, k) - ratio * temp(c, k);
            }
        }
    }

    return det;
}

AVX2

template<> float matrix<float>::determinant(void) const
{
    /*
    This method assumes square matrix
    */
    assert(row() == col());
    /*
    We need to create a temporary
    */
    matrix<float> temp(*this);
    /*We convert the temporary to upper triangular form*/
    float det = 1.0f;

    const uint N = row();
    const uint Nm8 = N - 8;
    const uint Nm4 = N - 4;

    uint c = 0;
    for (; c < Nm8; ++c)
    {
        det *= temp(c, c);
        float8 Diagonal = _mm256_set1_ps(temp(c, c));

        for (uint r = c + 1; r < N;++r)
        {
            float8 ratio1 = _mm256_div_ps(_mm256_set1_ps(temp(r,c)), Diagonal);

            uint k = c + 1;
            for (; k < Nm8; k += 8)
            {
                float8 ref = _mm256_loadu_ps(temp._v + c*N + k);
                float8 r0 = _mm256_loadu_ps(temp._v + r*N + k);

                _mm256_storeu_ps(temp._v + r*N + k, _mm256_fmsub_ps(ratio1, ref, r0));
            }

            /*We go Scalar for the last few elements to handle non-multiples of 8*/
            for (; k < N; ++k)
            {
                _mm_store_ss(temp._v + index(r, k), _mm_sub_ss(_mm_set_ss(temp(r, k)), _mm_mul_ss(_mm256_castps256_ps128(ratio1),_mm_set_ss(temp(c, k)))));
            }
        }
    }

    for (; c < Nm4; ++c)
    {
        det *= temp(c, c);
        float4 Diagonal = _mm_set1_ps(temp(c, c));

        for (uint r = c + 1; r < N; ++r)
        {
            float4 ratio = _mm_div_ps(_mm_set1_ps(temp[r*N + c]), Diagonal);
            uint k = c + 1;
            for (; k < Nm4; k += 4)
            {
                float4 ref = _mm_loadu_ps(temp._v + c*N + k);
                float4 r0 = _mm_loadu_ps(temp._v + r*N + k);

                _mm_storeu_ps(temp._v + r*N + k, _mm_sub_ps(r0, _mm_mul_ps(ref, ratio)));
            }

            float fratio = _mm_cvtss_f32(ratio);
            for (; k < N; ++k)
            {
                temp(r, k) = temp(r, k) - fratio*temp(c, k);
            }
        }
    }

    for (; c < N; ++c)
    {
        det *= temp(c, c);
        float Diagonal = temp(c, c);
        for (uint r = c + 1; r < N; ++r)
        {
            float ratio = temp[r*N + c] / Diagonal;
            for (uint k = c+1; k < N;++k)
            {
                temp(r, k) = temp(r, k) - ratio*temp(c, k);
            }
        }
    }

    return det;
}
double-beep
  • 5,031
  • 17
  • 33
  • 41
Chase R Lewis
  • 2,119
  • 1
  • 22
  • 47
  • 2
    Given that you're clearly ready to dig into details, and Eigen is open source...[why not look at what it does](https://github.com/RLovelett/eigen/search?utf8=%E2%9C%93&q=determinant) or step through it...? – HostileFork says dont trust SE Mar 28 '16 at 00:37
  • Their method doesn't make much sense to me making it difficult to adapt to the way my library operates. If I understood the mathematical reasoning behind it I'd be able to adapt it easily enough. I think it has to do with partial pivoting which I'm starting to look into. The other methods have made sense to me, but this is the first I haven't been able to understand the methodology behind it. In general, just curious if another brain has an idea of 'the best way'. Even after posting this question I'm still very much looking into it and will post my code when I find a better way. – Chase R Lewis Mar 28 '16 at 01:48
  • [this](http://stackoverflow.com/questions/22479258/cholesky-decomposition-with-openmp) might be interesting to you. – Z boson Mar 28 '16 at 08:12
  • I think you need to do pivoting to deal with matrices like (0 1; 1 0) that have determinant -1, but on which your method will fail, I think. – dmuir Mar 28 '16 at 11:01
  • Yeah it will, I was working on figuring out what algorithm I wanted to use before doing that. – Chase R Lewis Mar 29 '16 at 04:13

2 Answers2

5

Algorithms to reduce an n by n matrix to upper (or lower) triangular form by Gaussian elimination generally have complexity of O(n^3) (where ^ represents "to power of").

There are alternative approaches for computing determinate, such as evaluating the set of eigenvalues (the determinant of a square matrix is equal to the product of its eigenvalues). For general matrices, computation of the complete set of eigenvalues is also - practically - O(n^3).

In theory, however, calculation of the set of eigenvalues has complexity of n^w where w is between 2 and 2.376 - which means for (much) larger matrices it will be faster than using Gaussian elimination. Have a look at an article "Fast linear algebra is stable" by James Demmel, Ioana Dumitriu, and Olga Holtz in Numerische Mathematik, Volume 108, Issue 1, pp. 59-91, November 2007. If Eigen uses an approach with complexity less than O(n^3) for larger matrices (I don't know, never having had reason to investigate such things) that would explain your observations.

Peter
  • 35,646
  • 4
  • 32
  • 74
  • Thanks that paper is interesting. I think 'Block LU Factorization' is the method Eigen uses with 8x8 sub blocks. That paper says the time complexity is about O(n^2.5). I will look more into the Eigenvalue option also. – Chase R Lewis Mar 28 '16 at 03:28
1

The answer most places seem to use Block LU Factorization to create an Lower triangle and Upper triangle matrix in the same memory space. It is ~O(n^2.5) depending on the size of block you use.

Here is a power point from Rice University that explains the algorithm.

www.caam.rice.edu/~timwar/MA471F03/Lecture24.ppt

Division by a matrix means multiplication by its inverse.

The idea seems to be to increase the number of n^2 operations significantly but reduce the number m^3 which in effect lowers the complexity of the algorithm since m is of a fixed small size.

Going to take a little bit to write this up in an efficient manner since to do it efficiently requires 'in place' algorithms I don't have written yet.

Chase R Lewis
  • 2,119
  • 1
  • 22
  • 47