Is there a way to increase the speed of working with arrays in Fortran under Windows, like Python numpy?

Question

Sorry for the possible duplication. About the problem. numpy (1.18.2) in python 3.8.2 gives me a very high simulation speed (3 times faster) for a matrix product compared to GNU Fortran (9.2.0 MinGW.org GCC Build-20200227-1) under Windows. I used the command gfortran.exe test.f without any additional options.

Does anyone know what is causing this and is it possible to increase the simulation speed in Fortran?

Here is the fortran code:

program product_test
    INTEGER :: N,N_count,i,j,k,nc
    REAL*8 :: t1,t2
    REAL*8,dimension (:,:), allocatable :: a,b,c

    N = 200
    N_count = 10

    allocate ( a(N,N) )
    allocate ( b(N,N) )
    allocate ( c(N,N) ) 

    call RANDOM_NUMBER(a)
    call RANDOM_NUMBER(b)

    print *, 'Matrix Multiplication: C = A * B for size (',N,',',N,')'
    call CPU_TIME ( time_begin )
    do nc=1,N_count
        c = MATMUL(a,b)
    end do
    call CPU_TIME ( time_end )
    t2 = (time_end - time_begin)/N_count
    print *, 'Time of operation was ', t2, ' seconds'

end

Here is the output:

Matrix Multiplication: C = A * B for size ( 200 , 200 )
Time of operation was 9.3749E-003 seconds

Here is the python 3 code:

import numpy as np
import time

N = 200
N_count = 10

a = np.random.rand(N,N)
b = np.random.rand(N,N)
c = np.zeros([N,N], dtype = float)


print('Matrix product in python (using numpy): c= a*b for size (',N,',',N,')')
start_time = time.time()
for nc in range(N_count):
    c = a@b
t2 = (time.time() - start_time)/N_count
print('Elapsed time = ',t2,'s')

Here is the output:

Matrix product in python (using numpy): c= a*b for size ( 200 , 200 )
Elapsed time = 0.0031252 s

**Additional tests.** following to the comments of "roygvib" and "Vladimir F", I have done the test with blas/lapack:

gfortran test.f -lopenblas -o test.exe or gfortran test.f -ffast-math -o test.exe or gfortran test.f -lblas -o test.exe or gfortran test.f -llapack -o test.exe give me the calculation time of 0.0063s for matrix multiplication of square matrices with the size ( 200 x 200 ).

Unfortunately, I have deleted the previous version of mingw and new tests were performed under GNU Fortran (x86_64-posix-seh-rev0, Built by MinGW-W64 project 8.1.0). May be I did something incorrect because there are no difference between -llapack, -lblas, -lopenblas. For time measuring, I used SYSTEM_CLOCK as suggested by "Vladimir F".

Now, it is better, but numpy still faster than fortran (not three times but two times). Following to the last comment of "Vladimir F", I found that unlike Python, Fortran uses mainly one logical core (there are 4 logical cores on my PC with intel i3 CPU). Thus, this is a problem of improperly configured MinGW on my PC (Windows8.1).

I guess np.dot may be using multi-cores on your machine (e.g., it uses 4 cores on my mac). This can be checked by changing N_count to 10000 and looking at htop etc. Also, gfortran has an option "-fexternal-blas", which may be nice when used with Openblas etc. — roygvib, Apr 08 '20 at 02:48
I reopened this. The proposed dupe was nothing of the sort - it concerned a miscalculation of speedup arising from an ill-parenthesised expression. I'm sure there is a dupe, but haven't looked for it (yet). — High Performance Mark, Apr 08 '20 at 06:31
"roygvib", thank you for helpful comment. Unfortunately, the option `-fexternal-blas` in `gfortran` give me an error, I think this is due to the blas search problem. — Alexander Korovin, Apr 08 '20 at 07:17
`gfortran test.f -fexternal-blas` on my windows system gives me an error: "undefined reference to `dgemm_'" — Alexander Korovin, Apr 08 '20 at 10:26
In your new test.f you do what exactly? MATMUL or DGEMM? Linking any -lblas or -lopenblas will do exactly nothing unless you also use -fexternal-blas or call DGEMM explicitly. Do not forget -O3. You say *"Fortran uses all 4 cores of my PC"*. How do you determine that? How do you control that? — Vladimir F Героям слава, Apr 08 '20 at 14:05
Than you, "Vladimir F". "roygvib" is right. When I cleared the memory of my PC, I found that only one core (logical) was working. Thus, this is a problem of improperly configured MinGW on my PC. I did not use `dgemm`, maybe `matmul` can use this as you wrote. But `dgemm` is a part of Intel MKL, and I do not know how to add it into MinGW. — Alexander Korovin, Apr 08 '20 at 15:17
From your screenshot I conclude that only one core is used by Fortran. Note that Fortran should not magically use all cores, you have to tell it to do so. There are ways to run OpenBLAS thread, for example, but you have to tell it to do so. This has **nothing** to do with MinGW configuration, you must configure the external library. Also, **DGEMM** is NOT just part of Intel MKL, it **is a part of BLAS**. The way to "add it" is to use `-lblas` or `-lopenblas`. It might be the time you opened the documentation of the software you are using. — Vladimir F Героям слава, Apr 08 '20 at 15:19
Simply just call DGEMM, link `-lopenblas` and test. If it is not multithread yet, set the number of threads for OpenBLAS according to the manual https://fossies.org/linux/OpenBLAS/USAGE.md. — Vladimir F Героям слава, Apr 08 '20 at 15:22

Vladimir F Героям слава · Answer 1 · 2020-04-08T07:46:52.073

4

Use MATMUL or external libraries like BLAS for matrix multiplication in Fortran, We have many questions that deal with the performance of matrix multiplication

Fortran matrix multiplication performance in different optimization
performance of fortran matrix operations
How does BLAS get such extreme performance?

You should read them first. You should never do matrix multiplication in a naive for loop, that will always be slow. There are special algorithms for matrix multiplication. They use the memory bandwidth in an efficient way and also employ vectorizing instructions (often written directly in assembly).

Many Fortran compilers will allow you to call BLAS xGEMM directly through MATMUL. In gfortran it is possible with -fexternal-blas mentioned by roygvib. If you have problems with that, call DGEMM directly.

Certain BLAS implementations are able to use multiple threads. If you try that you must not use CPU_TIME to measure the speed, you have to use SYSTEM_CLOCK or an alternative.

Also, you did not report using any optimization flags like -O3. These are necessary for any decent performance unless an optimized external library does all the work.

edited Apr 08 '20 at 07:46

answered Apr 07 '20 at 21:24

Vladimir F Героям слава

57,977
4
76
119

2

Perhaps worth adding that this is how numpy and all the other super-fast (ha!) dynamic languages achieve their miraculous speed-ups on matrix multiplication when compared with home-rolled old Fortran - they call BLAS. – High Performance Mark Apr 08 '20 at 06:33
I'm sorry. My question was about comparing numpy and matmul. To avoid misunderstandings, I deleted that part of the code (direct algorithm with for/do) that was not devoted to my question. – Alexander Korovin Apr 08 '20 at 06:38
In addition, I did not find the optimization option “-O3” for “gfortran”. “gfortran -O3 test.f” and “gfortran test.f” give me the same speed. Could you please precise your opinion about this flag? – Alexander Korovin Apr 08 '20 at 07:01
*"These are necessary for any decent performance **unless an optimized external library does all the work**."* MATMUL is an external library. – Vladimir F Героям слава Apr 08 '20 at 07:37
1

Check, as has been hinted in the comments, that you are using the same number of cores in NumPy and in Fortran. Be aware that even if you tried to use more cores in Fortran using a threaded BLAS library, you **cannot use CPU_TIME** to measure the speed! https://stackoverflow.com/questions/25465101/fortran-openmp-program-shows-no-speedup-of-cpu-time – Vladimir F Героям слава Apr 08 '20 at 07:40

Alexander Korovin · Accepted Answer · 2020-04-21T13:34:51.443

The problem was maybe in the compatibility of different versions. I updated the compiler and libraries (I upgraded to gcc 9.3.0, openblas 0.3.9 after uninstalling all previous versions).

Now the following results for the matrix product: c = a * b with matrix size (2000x2000) (with averaging 20 trials) are more adequate (I carried out the test on a PC with Intel i5 (4 logical cores) under Windows 10):

0.237833s(minGW64) and 0.236853s(cygwin64). C++ with armadillo using gcc 9.3.0+openblas 0.3.9
0.2492s(minGW64) and 0.2479(cygwin64), norm = 0. Fortran (matmul) with -fexternal-blas flag, command line: gfortran FILE_NAME.f95 -o FILE_NAME -O3 -ffast-math -fexternal-blas "[pathto]\libopenblas_v0.3.9-gcc_9_3_0.a" (gcc 9.3.0, openblas 0.3.9)
0.2484s(dgemm) whereas 1.12894s for matmul, norm = 1.5695E-10. Fortran in minGW64 with -lopenblas flag, command line: gfortran FILE_NAME.f95 -o FILE_NAME -O3 -ffast-math lopenblas (gcc 9.3.0)
0.2562533s, norm = 0.0. python (numpy)
0.285133s(R2016a) and 0.269926s(R2020a), norm = 8.4623e-12. Matlab 64.
0.3133s, norm = 1.5695E-10. Fortran (matmul) in minGW64/cygwin64 with -lblas flag, command line: gfortran FILE_NAME.f95 -o FILE_NAME -O3 -ffast-math -lblas (gcc 9.3.0, in cygwin64).

To run these tests, I used cygwin (or minGW) to compile c++ code using armadillo (OpenMP C++ Matrix Multiplication run slower in parallel), where three matrices A, B, C were created and saved to disk to use the same matrices in these tests. Thus, the “norm” indicates the relative accuracy of the matrix product. I found that numpy uses openblas (libopenblas.PYQHXLVVQ7VESDPUVUADXEVJOBGHJPAY.gfortran-win_amd64). Matlab on my PC gives me the next information about library of blas/lapack: Intel (R) Math Kernel Library version 11.2.3 Build 20150413 for applications with Intel (R) 64 architecture, CNR AVX2 branch in R2016a and Intel(R) Math Kernel Library Version 2019.0.3 Product Build 20190125 for Intel(R) 64 architecture applications, CNR branch AVX2 in R2020a.

The fortran simulation speed is now resonable with respect to other languages. And openBLAS won in C++ (perhaps, due to its adaptation for C). Noting that matlab shows a relatively high computational speed with not fully used CPUs. All languages/programs use all 4 kernels of my system:

Is there a way to increase the speed of working with arrays in Fortran under Windows, like Python numpy?

2 Answers2