2

I'm doing some basic multiplication using armadillo but for some reason it takes very long to complete. I'm quite new to c++ so I might be doing something wrong, but I can't see it even in this very basic example:

#include <armadillo>
#include <iostream>

using namespace arma;

int main(){
    arma::vec coefficients = {1.0, 1.09, 1.08};
    arma::mat X = arma::mat(100000, 3, fill::randu) * coefficients;

    cout << X.n_cols;
}

when I mean very slow, I have run this example for some minutes and it doesn't finish

EDIT

I run the script with perf stat ./main, but stopped it after some time because it shouldn't take that long. This is the output.

^C./main: Interrupt

 Performance counter stats for './main':

        257,169.20 msec task-clock                #    1.003 CPUs utilized          
             3,342      context-switches          #   12.995 /sec                   
               215      cpu-migrations            #    0.836 /sec                   
             1,312      page-faults               #    5.102 /sec                   
   963,025,520,077      cycles                    #    3.745 GHz                    
   542,959,361,927      instructions              #    0.56  insn per cycle         
   113,002,342,332      branches                  #  439.409 M/sec                  
     1,095,168,312      branch-misses             #    0.97% of all branches        

     256.349026907 seconds time elapsed

     147.860947000 seconds user
     109.317743000 seconds sys
Alejandro Andrade
  • 2,196
  • 21
  • 40
  • What does `fill::randu` cost? I've seen some really expensive random number generators. – user4581301 Nov 22 '21 at 23:29
  • @user4581301 that is not the problem if you create the matrix first (`arma::mat Y = arma::mat(100000, 3, fill::randu)`) it is done instantaneously. The problem is the multiplication – Alejandro Andrade Nov 22 '21 at 23:34
  • Groovy. Figured it was worth checking. – user4581301 Nov 22 '21 at 23:35
  • 1
    That seems very fast for me. How are you compiling it? – EdmCoff Nov 22 '21 at 23:48
  • I'm compiling with g++ in ubuntu 20.04 – Alejandro Andrade Nov 22 '21 at 23:49
  • What's the optimization level are you compiling? Is it at least `-O2`? – Bob__ Nov 22 '21 at 23:53
  • Did you enabled optimizations? Like `-O3`, or the "unsafe" ones like `-march=native` and `-ffast-math`. – Jérôme Richard Nov 22 '21 at 23:53
  • I think the full command you're compiling with would be useful. – EdmCoff Nov 22 '21 at 23:54
  • I'm using a very basic command to be honest. `g++ main.cpp -o main -larmadillo` – Alejandro Andrade Nov 22 '21 at 23:58
  • 1
    g++ default optimization level is `-O0` or "optimization for compilation time". Try `g++ -O3 main.cpp -o main -larmadillo` – Bob__ Nov 23 '21 at 00:01
  • Even that is very fast with your example (on my completely different system). But I would try it with -O2 or -O3 as others have suggested. – EdmCoff Nov 23 '21 at 00:01
  • Seem like it might be my system :sob:. I tried with `O2` and `O3` but doesn't solve the problem. – Alejandro Andrade Nov 23 '21 at 00:04
  • This is weird. Can you report the result of calling `perf stat ./main` in your question? If you do not have the `perf` profiling tool yet, you need to [install it](https://askubuntu.com/questions/50145/how-to-install-perf-monitoring-tool). – Jérôme Richard Nov 23 '21 at 00:16
  • Just in case, add a `<< std::endl` after `std::cout << X.n_cols`. – Bob__ Nov 23 '21 at 00:16
  • 1
    @JérômeRichard took me a while to install perf. But I updated the question – Alejandro Andrade Nov 23 '21 at 12:46
  • Your program runtime seems to be weirdly split between userspace and the kernel. Can you run `strace ./main` to see what system calls it is doing? I would expect it not to be doing _any_ system calls after the startup. – Botje Nov 23 '21 at 15:05
  • did you check with a profiler? – Jepessen Nov 23 '21 at 15:17
  • One would hope that Armadillo stores the matrix with the rows contiguously, but just in case, flip the dimensions and multiply from the left? Also: is there a way to let Armadillo use optimized Blas routines? Your matrix is far from square so it may not make much of a difference, but it doesn't hurt to try. – Victor Eijkhout Nov 23 '21 at 15:18
  • 1
    Looking up the documentation my suspicion pans out: the matrix is stored the wrong way around. Transpose it and do a transpose product. Better for cache, TLB, whatnot. – Victor Eijkhout Nov 23 '21 at 15:25
  • A transposition may definitively help, but I still find this weird that the program spent so much time in doing branches (20% of the instructions and typically most of the time on x86 due to branch-related instruction not mark as such). TLB misses might explain this, but if so, the overhead for TLB misses would really huge. Moreover, on modern processors I do not expect much TLB misses since there are only 3 columns and processors have a lot of TLB entries (and pre-fetchers support a lot of concurrent streams). – Jérôme Richard Nov 23 '21 at 18:37
  • [Here](https://stackoverflow.com/a/39025656/12939557) is a command to check whether the TLB is an issue or not. In order to know better the source of the problem, `perf record -e branches ./main` + `perf report` can be used (possibly with the option `--call-graph` for `perf report`). You may need to put 0 in the [`/proc/sys/kernel/perf_event_paranoid`](https://stackoverflow.com/questions/51911368/what-restriction-is-perf-event-paranoid-1-actually-putting-on-x86-perf) file so the result of perf can be useful. – Jérôme Richard Nov 23 '21 at 18:51
  • **How dis you install armadillo?** These dimensions are low for armadillo and this code runs instantly in my notebook. Armadillo uses BLAS and LAPACK to handle operations. You should need to link with them, if you are using armadillo as a header-only library, or to build armadillo wrapper library, which in turn will link with them. Armadillo has its own implementation, if these are not available, but it is slow and I suspect this is your problem. – darcamo Nov 24 '21 at 17:46
  • @darcamo I basically follow the instructions here http://codingadventures.org/2020/05/24/how-to-install-armadillo-library-in-ubuntu/ and in the armadillo webpage. But if you have any recomendations on how to reinstall it more than welcome. Note that I downloaded the last version and not the one mention in the article – Alejandro Andrade Nov 24 '21 at 20:44
  • 1
    Changing the compiling code to `g++ main.cpp -o main -DARMA_DONT_USE_WRAPPER -larmadillo -llapack` solves the problem. I would gladly accept the answer of anyone that explains why. – Alejandro Andrade Nov 24 '21 at 21:45

1 Answers1

2

Armadillo is a template-based library that can be used as a header-only library. Just include its header and make sure you link with some BLAS and LAPACK implementation. When used like this, armadillo assumes you have a BLAS and LAPACK implementation available. You will get link errors if you try to use any functionality in armadillo that requires them without linking with them. If you don't have BLAS and/or LAPACK, you can change the armadillo_bits/config.hpp file and comment out some defines there such that armadillo uses its own (slower) implementation of that functionality.

Alternatively, armadillo can be compiled as a wrapper library, where in that case you just link with the "armadillo" wrapper library. It's CMake code will try to determine during configure time what you have available and "comment-out the appropriated defines" in case you don't have some requirement available, which in turn will make it use the slower implementation. That "configure" code is wrongly determining that you don't have BLAS available, since BLAS is the one providing fast matrix multiplication.

My suggestion is to just make sure you have BLAS and LAPACK installed and use armadillo as a header-only library, making sure to link your program with BLAS and LAPACK.

Another option is using the conan package manager to install armadillo. Conan added a recipe to install armadillo recently. It has the advantage that it will install everything that armadillo needs (it installs openblas, which provides both a BLAS and LAPACK implementation) and it is system agnostic (similar to virtual environments in Python).


Note

In the comments you mentioned that it worked with g++ main.cpp -o main -DARMA_DONT_USE_WRAPPER -larmadillo -llapack. The reason is that even if you installed the wrapper library, if you define ARMA_DONT_USE_WRAPPER you are effectivelly using armadillo as a header-only library. You can replace -larmadillo -llapack with -lblas -llapack.

darcamo
  • 3,294
  • 1
  • 16
  • 27
  • So just to be clear by passing the arguments `-larmadillo -llapack` i'm linking armadillo with the correct LAPACK? – Alejandro Andrade Nov 25 '21 at 00:54
  • At that point you are build your own program and linking it with both the armadillo wrapper library and with the lapack library that you have installed in your system. Even if you are using armadillo as a header only library, due to the `ARMA_DONT_USE_WRAPPER` define, you can still link with the wrapper library, if you want. This just brings along what it was linked with. You can make things clear and just link directly with blas and lapack instead. – darcamo Nov 25 '21 at 12:51
  • Shameless plug: In case you use the gdb debugger you might be interested in [pretty printers](https://github.com/darcamo/gdb_armadillo_helpers) for the types in the armadillo library. – darcamo Nov 25 '21 at 12:57