17

What's the state of the art with regards to getting numpy to use mutliple cores (on Intel hardware) for things like inner and outer vector products, vector-matrix multiplications etc?

I am happy to rebuild numpy if necessary, but at this point I am looking at ways to speed things up without changing my code.

For reference, my show_config() is as follows, and I've never observed numpy to use more than one core:

atlas_threads_info:
    libraries = ['lapack', 'ptf77blas', 'ptcblas', 'atlas']
    library_dirs = ['/usr/local/atlas-3.9.16/lib']
    language = f77
    include_dirs = ['/usr/local/atlas-3.9.16/include']

blas_opt_info:
    libraries = ['ptf77blas', 'ptcblas', 'atlas']
    library_dirs = ['/usr/local/atlas-3.9.16/lib']
    define_macros = [('ATLAS_INFO', '"\\"3.9.16\\""')]
    language = c
    include_dirs = ['/usr/local/atlas-3.9.16/include']

atlas_blas_threads_info:
    libraries = ['ptf77blas', 'ptcblas', 'atlas']
    library_dirs = ['/usr/local/atlas-3.9.16/lib']
    language = c
    include_dirs = ['/usr/local/atlas-3.9.16/include']

lapack_opt_info:
    libraries = ['lapack', 'ptf77blas', 'ptcblas', 'atlas']
    library_dirs = ['/usr/local/atlas-3.9.16/lib']
    define_macros = [('ATLAS_INFO', '"\\"3.9.16\\""')]
    language = f77
    include_dirs = ['/usr/local/atlas-3.9.16/include']

lapack_mkl_info:
  NOT AVAILABLE

blas_mkl_info:
  NOT AVAILABLE

mkl_info:
  NOT AVAILABLE
NPE
  • 486,780
  • 108
  • 951
  • 1,012
  • 4
    I doubt you can achive any speedup by the multithreaded computation fo dot products of vectors of size 4000. Such a dot product needs only a few microseconds to compute. The overhead of assigning the task to a separate thread will probably at least nullify any speed you might gain, even when using thread pools. – Sven Marnach May 13 '11 at 20:55
  • I'm multiplying 32M x (4k ... 1.5M) matrices with (4k ... 1.5M) x something matrices, and tried to do so using the multiprocessing-toolbox, nevertheless this seems to create a lot of memory overhead, as data is copied to new processes (thank the GIL for that). Would be great if all 8 cores were used by atlas. – Herbert Jun 30 '15 at 14:07

1 Answers1

7

You should probably start by checking whether the Atlas build that numpy is using has been built with multi-threading. You can build and run this to inspect the Atlas configuration (straight from the Atlas FAQ):

main()
/*
 * Compile, link and run with something like:
 *    gcc -o xprint_buildinfo -L[ATLAS lib dir] -latlas ; ./xprint_buildinfo
 * if link fails, you are using ATLAS version older than 3.3.6.
 */
{
   void ATL_buildinfo(void);
   ATL_buildinfo();
   exit(0);
}

If you have don't have a multithreaded version of Atlas: "there's your problem". If it is multithreaded, then you need to exercise one of the multithreaded BLAS3 routines (probably dgemm), with a suitably large matrix-matrix product and see whether threading is used. I think I am right in saying that neither BLAS 2 and BLAS 1 routines in Atlas support multithreading (and with good reason because there is no performance advantage except at truly enormous problem sizes).

talonmies
  • 70,661
  • 34
  • 192
  • 269