Eigen is much slower than Fortran in matrix multiplication using an explicit loop

Question

I tried to rewrite code from Fortran to C++ with a 2000*2000 matrix multiplication implements through Eigen library. I found that for loop in Eigen is much slower (>3x) than do loop in Fortran. The codes are listed below:

test.f90

program main
implicit none
integer :: n,i,j,k
integer :: tic,toc
real(8),ALLOCATABLE ::a(:,:),b(:,:),c(:,:)
real(8) :: s

n = 2000
allocate(a(n,n),b(n,n),c(n,n))
do i=1,n
    do j =1,n
        a(j,i) = i * 1.0
        b(j,i) = i * 1.0
    enddo
enddo

call system_clock(tic)
do j=1,n
    do i=1,n
        s = 0.0
        do k=1,n
            s = s + a(i,k) * b(k,j)
        enddo
        c(i,j) = s
    enddo
enddo
call system_clock(toc)
print*,'Fortran with loop:', (toc - tic) / 1000.0

call system_clock(tic)
c = matmul(a,b)
call system_clock(toc)
print*,'Fortran with matmul:', (toc - tic) / 1000.0


DEALLOCATE(a,b,c)
end

test.cpp

#include<Eigen/Core>
#include<time.h>
#include<iostream>
using Eigen::MatrixXd;

int main(){
    int n = 2000;
    MatrixXd a(n,n),b(n,n),c(n,n);
    for(int i=0;i<n;i++){
    for(int j=0;j<n;j++){
            a(i,j) = i * 1.0;
            b(i,j) = j * 1.0;
        }
    }
    clock_t tic,toc;
    tic = clock();
    for(int j=0;j<n;j++){
        for(int i=0;i<n;i++){
            double s= 0.0;
            for(int k=0;k<n;k++){
                s += a(i,k) * b(k,j);
            }
            c(i,j) = s;
        }
    }
    toc = clock();
    std::cout << (double)((toc - tic)) / CLOCKS_PER_SEC << std::endl;

    tic = clock();
    c=  a * b;
    toc = clock();
    std::cout << (double)((toc - tic)) / CLOCKS_PER_SEC << std::endl;
}

Compiled by(with gcc-8.4, in Ubuntu-18.04)

gfortran test.f90 -O3 -march=native -o testf
g++ test.cpp -O3 -march=native -I/path/to/eigen -o testcpp

And I get results:

Fortran with loop:   10.9700003
Fortran with matmul:   0.834999979
Eigen with loop: 38.2188
Eigen with *: 0.40625

The internal implementation is of comparable speed, but why Eigen is much slower for the loop implementation?

I think in Fortran operator() is a native way to address memory, but in C++ it would be operator[]. The seamingly same operator() in Eigen returns reference object to a memory with overridden operator= and type() to modify and access it. This overhead is probably the reason why loop implementation in Eigen is so slow compared to Fortran — Semyon Burov, Dec 19 '20 at 06:09
-ffast-math is misnamed. It should be named -funsafe-math. At least with Fortran, it will break conforming Fortran program. — evets, Dec 19 '20 at 07:26
@SemyonBurov Quite possibly (with a note that indexing is not an operator in Fortran), but foremost I do not see why anyone would compute matrix multiplication from a naive definition loop in Eigen. — Vladimir F Героям слава, Dec 19 '20 at 10:26
The title is a bit misleading, if Eigen `Matrix * Matrix` is faster than Fortrans `matmul`. If you think you need to manually access lots of coefficients, you should consider building with `-DNDEBUG` (after being sure that the code runs as intended). — chtz, Dec 19 '20 at 11:36
Note system_clock doesn't necessarily measure in milliseconds. Look into the rate argument — Ian Bush, Dec 19 '20 at 13:51
Also note real( 8 ) isn't portable - https://stackoverflow.com/questions/838310/fortran-90-kind-parameter — Ian Bush, Dec 19 '20 at 14:14
@chtz I rebuild test.cpp with ```-DNDEBUG```, but the resutls change little. — Clock Su, Dec 19 '20 at 14:46
@chtz There is no universal Fortran's matmul. Not even one gfortran's matmul. gfortran can either use its simple implementation or call an optimized BLAS library if told to do so. Other compilers can call whatever they want or are instructed to call. — Vladimir F Героям слава, Dec 19 '20 at 14:49
Regarding the Fortran code: `real(8)` is nonstandard, use the intrinsic `iso_fortran_env` module and `real(real32)` or `real(real64)`. Furthermore, the arrays `a` and `b` are probably initialized with a loss of performance, because `i * 1.0` is single precision. (Just write `a(j,i) = i` instead.) Same for `1000.0`, which is single precision. — knia, Feb 18 '21 at 08:39
The default storage order of Eigen matrices is column-major just like Fortran. See the answer by knia, where loop re-ordering gives a five-fold improvement. — IPribec, Jan 29 '23 at 10:25

knia · Answer 1 · 2023-01-24T09:53:34.613

The biggest problem with the loops is that they are not done in the proper order for either C++ (which should be row-major), or Fortran (which should be column-major). This gives you a large performance hit, especially for large matrices.

The nativemul implementation by John Alexiou (with dot_product) has the same problem, so I am very surprised that he claims it's faster. (And I find that it isn't; see below. Maybe his (intel?) compiler rewrites the code to use matmul internally.)

This is the correct loop order for Fortran:

    c = 0
    do j=1,n
        do k=1,n
            do i=1,n
                c(i,j) = c(i,j) + a(i,k) * b(k,j)
            enddo
        enddo
    enddo

With gfortran version 10.2.0, and compiled with -O3, I get

 Fortran with original OP's loop:        53.5190010
 Fortran with John Alexiou's nativemul:  53.4309998
 Fortran with correct loop:              11.0679998
 Fortran with matmul:                     2.3699999

A correct loop in C++ should give you similar performance.

Obviously matmul/BLAS are much faster for large matrices.

Eigen matrices use column-major storage order by default. You can pick the storage order via a template parameter: https://eigen.tuxfamily.org/dox/group__TopicStorageOrders.html. This implies that the same order you suggest for the Fortran naive matmul should be used in the naive Eigen C++ version too. — IPribec, Jan 29 '23 at 10:19

John Alexiou · Answer 2 · 2021-02-18T13:44:49.403

In the Fortran code I saw the same problem, but then I moved the matrix multiplication in a subroutine and the resultant speed was almost as good as matmul. I also compared to BLAS Level 3 function.

Fortran with loop:   9.220000
Fortran with matmul:   8.450000
Fortran with blas3:   2.050000

and the code to produce it

program ConsoleMatMul
use BLAS95
implicit none
integer :: n,i,j
integer :: tic,toc
real(8),ALLOCATABLE :: a(:,:),b(:,:),c(:,:),xe(:,:)

n = 2000
allocate(a(n,n),b(n,n),c(n,n),xe(n,n))
do i=1,n
    do j =1,n
        a(j,i) = i * 1.0
        b(j,i) = i * 1.0
    enddo
enddo

call system_clock(tic)
call nativemul(a,b,c)
call system_clock(toc)
print*,'Fortran with loop:', (toc - tic) / 1000.0

call system_clock(tic)
c = matmul(a,b)
call system_clock(toc)
print*,'Fortran with matmul:', (toc - tic) / 1000.0
c = b
xe = 0d0
call system_clock(tic)
call gemm(a,c,xe) ! BLAS MATRIX/MATRIX MUL
call system_clock(toc)
print*,'Fortran with blas3:', (toc - tic) / 1000.0

DEALLOCATE(a,b,c)

contains

pure subroutine nativemul(a,b,c)
real(8), intent(in), allocatable :: a(:,:), b(:,:)
real(8), intent(out), allocatable :: c(:,:)
real(8) :: s
integer :: n, i,j,k
    n = size(a,1)
    if (.not. allocated(c)) allocate(c(n,n))
    do j=1,n
        do i=1,n
            s = 0.0d0
            do k=1,n
                s = s + a(i,k) * b(k,j)
            end do
            c(i,j) = s
        end do
    end do
end subroutine    

end program ConsoleMatMul

before I moved the code into a subroutine I got

Fortran with loop:   85.450000

Update the native multiplication reaches matmul levels (or exceeds it) when the inner loop is replaced by a dot_product().

pure subroutine nativemul(a,b,c)
real(8), intent(in) :: a(:,:), b(:,:)
real(8), intent(out) :: c(:,:)
integer :: n, i,j
    n = size(a,1)
    do j=1,n
        do i=1,n
            c(i,j) = dot_product(a(i,:),b(:,j))
            ! or  = sum(a(i,:)*b(:,j))
        end do
    end do
end subroutine

The matrix `c` is always unallocated at the beginning of `nativemul`, because it is declared as `intent(out), allocatable`. Therefore the timing for `nativemul` ("Fortran with loop") includes the re-allocation. — knia, Feb 18 '21 at 09:36
@A.Hennink - good point. Would `real(8), intent(inout), allocatable :: c(:,:)` fix the situation? — John Alexiou, Feb 18 '21 at 13:13
yes... but what you want is `intent(out)`, without `allocatable`. The "actual argument" (what you put in when the procedure is called) is allocatable, but the dummy argument (which is declared in the procedure) need not be. — knia, Feb 18 '21 at 13:26

score -1 · Answer 3 · answered Jan 05 '21 at 14:43

-1

C++ pre-increment is faster than post-increment...

for(int j=0;j<n;++j){
        for(int i=0;i<n;++i){
            double s= 0.0;
            for(int k=0;k<n;++k){
                s += a(i,k) * b(k,j);
            }
            c(i,j) = s;
        }
    }

answered Jan 05 '21 at 14:43

PodHunter

74
4

1

So, what is the implication? Why is this fact relevant? – Vladimir F Героям слава Jan 05 '21 at 16:09

Eigen is much slower than Fortran in matrix multiplication using an explicit loop

3 Answers3