86

In my code I have loop in which I construct and over determined linear system and try to solve it:

#pragma omp parallel for
for (int i = 0; i < n[0]+1; i++) {
    for (int j = 0; j < n[1]+1; j++) {
        for (int k = 0; k < n[2]+1; k++) {
            arma::mat A(max_points, 2);
            arma::mat y(max_points, 1);
            // initialize A and y

            arma::vec solution = solve(A,y);
        }
    }
}

Sometimes, quite randomly the program hangs or the results in the solution vector are NaN. And if I put do this:

arma::vec solution;
#pragma omp critical 
{
    solution = solve(weights*A,weights*y);
}

then these problem don't seem to happen anymore.

When it hangs, it does so because some threads are waiting at the OpenMP barrier:

Thread 2 (Thread 0x7fe4325a5700 (LWP 39839)):
#0  0x00007fe44d3c2084 in gomp_team_barrier_wait_end () from /usr/lib64/gcc-4.9.2/lib64/gcc/x86_64-redhat-linux-gnu/4.9.2/libgomp.so.1
#1  0x00007fe44d3bf8c2 in gomp_thread_start () at ../.././libgomp/team.c:118
#2  0x0000003f64607851 in start_thread () from /lib64/libpthread.so.0
#3  0x0000003f642e890d in clone () from /lib64/libc.so.6

And the other threads are stuck inside Armadillo:

Thread 1 (Thread 0x7fe44afe2e60 (LWP 39800)):
#0  0x0000003ee541f748 in dscal_ () from /usr/lib64/libblas.so.3
#1  0x00007fe44c0d3666 in dlarfp_ () from /usr/lib64/atlas/liblapack.so.3
#2  0x00007fe44c058736 in dgelq2_ () from /usr/lib64/atlas/liblapack.so.3
#3  0x00007fe44c058ad9 in dgelqf_ () from /usr/lib64/atlas/liblapack.so.3
#4  0x00007fe44c059a32 in dgels_ () from /usr/lib64/atlas/liblapack.so.3
#5  0x00007fe44f09fb3d in bool arma::auxlib::solve_ud<double, arma::Glue<arma::Mat<double>, arma::Mat<double>, arma::glue_times> >(arma::Mat<double>&, arma::Mat<double>&, arma::Base<double, arma::Glue<arma::Mat<double>, arma::Mat<double>, arma::glue_times> > const&) () at /usr/include/armadillo_bits/lapack_wrapper.hpp:677
#6  0x00007fe44f0a0f87 in arma::Col<double>::Col<arma::Glue<arma::Glue<arma::Mat<double>, arma::Mat<double>, arma::glue_times>, arma::Glue<arma::Mat<double>, arma::Mat<double>, arma::glue_times>, arma::glue_solve> >(arma::Base<double, arma::Glue<arma::Glue<arma::Mat<double>, arma::Mat<double>, arma::glue_times>, arma::Glue<arma::Mat<double>, arma::Mat<double>, arma::glue_times>, arma::glue_solve> > const&) ()
at /usr/include/armadillo_bits/glue_solve_meat.hpp:39

As you can see from the stacktrace my version of Armadillo uses atlas. And according to this documentation atlas seems to be thread safe: ftp://lsec.cc.ac.cn/netlib/atlas/faq.html#tsafe

Update 9/11/2015

I finally got some time to run more tests, based on the suggestions of Vladimir F.

When I compile armadillo with ATLAS's BLAS, I'm still able to reproduce then hangs and the NaNs. When it hangs, the only thing that changes in the stacktrace is the call to BLAS:

#0  0x0000003fa8054718 in ATL_dscal_xp1yp0aXbX@plt () from /usr/lib64/atlas/libatlas.so.3
#1  0x0000003fb05e7666 in dlarfp_ () from /usr/lib64/atlas/liblapack.so.3
#2  0x0000003fb0576a61 in dgeqr2_ () from /usr/lib64/atlas/liblapack.so.3
#3  0x0000003fb0576e06 in dgeqrf_ () from /usr/lib64/atlas/liblapack.so.3
#4  0x0000003fb056d7d1 in dgels_ () from /usr/lib64/atlas/liblapack.so.3
#5  0x00007ff8f3de4c34 in void arma::lapack::gels<double>(char*, int*, int*, int*, double*, int*, double*, int*, double*, int*, int*) () at /usr/include/armadillo_bits/lapack_wrapper.hpp:677
#6  0x00007ff8f3de1787 in bool arma::auxlib::solve_od<double, arma::Glue<arma::Mat<double>, arma::Mat<double>, arma::glue_times> >(arma::Mat<double>&, arma::Mat<double>&, arma::Base<double, arma::Glue<arma::Mat<double>, arma::Mat<double>, arma::glue_times> > const&) () at /usr/include/armadillo_bits/auxlib_meat.hpp:3434

Compiling without ATLAS, only with netlib BLAS and LAPACK, I was able to reproduce the NaNs but not the hangs.

In both cases, surrounding solve() with #pragma omp critical I have no problems at all

GAMITG
  • 3,810
  • 7
  • 32
  • 51
maxdebayser
  • 1,066
  • 7
  • 10
  • 1
    Is /usr/lib64/libblas.so.3 part of atlas? Why it isn't located in /usr/lib64/atlas? – Vladimir F Героям слава May 05 '15 at 21:34
  • 1
    No, in opensuse it's part of package liblas3 and in redhat it's part of package blas. – maxdebayser Jul 16 '15 at 16:54
  • 2
    Then you cannot use any guarantees of ATLAS when you use the default BLAS. – Vladimir F Героям слава Jul 16 '15 at 16:57
  • 1
    This is very strange indeed, Good Sir, I think I have to find out why ATLAS LAPACK is using an external BLAS first. Thanks for pointing that out. – maxdebayser Jul 16 '15 at 17:35
  • 2
    Have you solved this? If not, which packages are installed, and could you please post the command you used to compile the program? – vindvaki Aug 09 '15 at 21:16
  • 3
    You can also try using [OpenBLAS](http://xianyi.github.io/OpenBLAS/) instead of Atlas. – mtall Aug 26 '15 at 05:08
  • No, it can be a really risky procedure. – Joshua Vaughan Sep 24 '15 at 18:09
  • Why would it be risky? It's simply a matter of linking with OpenBLAS instead of Atlas. – mtall Oct 02 '15 at 14:59
  • so what about install debug-info and seeing exactly where they are stack with gdb ? command is : sudo debuginfo-install blas . – g24l Oct 21 '15 at 21:44
  • You don't quite provide a [MCVE](http://stackoverflow.com/help/mcve), but I tried to reproduce your problem and could not (mine runs with no problems). – Brent Bradburn Oct 22 '15 at 00:33
  • You can try to use firefox rr (http://rr-project.org/) in order to obtain a valid stack trace when things go wrong. I don't really know if it will work on your problem because they claim this: "emulates a single-core machine. So, parallel programs incur the slowdown of running on a single core. This is an inherent feature of the design.". It is worth a shot though. – Gustavo Muenz Oct 26 '15 at 21:54

2 Answers2

2

Are you sure your systems are over determined? solve_ud in your stack trace says otherwise. Though you have solve_od too, and probably that's nothing to do with the issue. But it doesn't hurt to find why that's happening and fix it if you think the systems should be od.

Is armadillo solve() thread safe?

That I think depends on your lapack version, also see this. Looking at the code of solve_od all the variables accessed seem to be local. Note the warning in the code:

NOTE: the dgels() function in the lapack library supplied by ATLAS 3.6 seems to have problems

Thus it seems only lapack::gels can cause trouble for you. If fixing lapack is not possible, a workaround is to stack your systems and solve a single large system. That probably would be even more efficient if your individual systems are small.

fireant
  • 14,080
  • 4
  • 39
  • 48
1

The thread safety of Armadillo's solve() function depends (only) on the BLAS library that you use. The LAPACK implementations are thread safe when BLAS is. The Armadillo solve() function is not thread safe when linking to the reference BLAS library. However, it is thread safe when using OpenBLAS. Additionally, ATLAS provides a BLAS implementation that also mentions it is thread safe, and the Intel MKL is thread safe as well, but I have no experience with Armadillo linked to those libraries.

Of course, this only applies when you run solve() from multiple threads with different data.

André Offringa
  • 362
  • 2
  • 7