parallelized methods in python

Question

I am working on a scientific cluster, that has been recently upgraded by the administrator, and now my code is superslow, whereas it used to be decent. I am using python 3.4

The way this kind of things work is the following: I have to guess what the administrator may have changed and then ask him to make the opportune changes, because if I ask him a direct question we will not conclude anything.

So, I have run my code with a profiler and I have found that there are some routines that are called many times, these routines are:

built-in method array (called ~10^5, execution time 0.003s)
sort of numpy.ndarray (~5000, 0.03s)
uniformof mtrand.RandomState (~2000, 0.03s)

My guess is that some of these libraries were parallelized in the previous installed version of python, for example being linked to mpi-parallelized or multi-threated math kernel libraries.

I would like to know if my guess is correct or if I have to think to something else, because my code itself has not changed.

The routines I have quoted here are the most relevant, because they account for 85% of the total time. in particular, array takes 55% if the total time. The efficiency of my code was degraded by a factor 10. Before talking with the system manager I would like to get confirmation that these routines do have a parallel version.

Of course I cannot test my code on the new and old configuration of the cluster, because the old configuration is gone. But I can see that on this cluster numpy.array takes 8minutes, while on the other cluster that I have it takes 2seconds. From top I can see that the memory used is always very low (~0.1%) while a single CPU is used at 100%.

 In [3]: numpy.__config__.show()
 lapack_info:
     libraries = ['lapack']
     library_dirs = ['/usr/lib64']
     language = f77
 atlas_threads_info:
     libraries = ['satlas']
     library_dirs = ['/usr/lib64/atlas']
     define_macros = [('ATLAS_WITHOUT_LAPACK', None)]
     language = c
     include_dirs = ['/usr/include']
 blas_opt_info:
     libraries = ['satlas']
     library_dirs = ['/usr/lib64/atlas']
     define_macros = [('ATLAS_INFO', '"\\"3.10.1\\""')]
     language = c
     include_dirs = ['/usr/include']
 atlas_blas_threads_info:
     libraries = ['satlas']
     library_dirs = ['/usr/lib64/atlas']
     define_macros = [('ATLAS_INFO', '"\\"3.10.1\\""')]
     language = c
     include_dirs = ['/usr/include']
 openblas_info:
   NOT AVAILABLE
 lapack_opt_info:
     libraries = ['satlas', 'lapack']
     library_dirs = ['/usr/lib64/atlas', '/usr/lib64']
     define_macros = [('ATLAS_WITHOUT_LAPACK', None)]
     language = f77
     include_dirs = ['/usr/include']
 lapack_mkl_info:
   NOT AVAILABLE
 blas_mkl_info:
   NOT AVAILABLE
 mkl_info:
   NOT AVAILABLE

ldd /usr/lib64/python3.4/site-packages/numpy/core/_dotblas.cpython-34m.so
     linux-vdso.so.1 =>  (0x00007fff46172000)
     libsatlas.so.3 => /usr/lib64/atlas/libsatlas.so.3 (0x00007f0d941a0000)
     libpython3.4m.so.1.0 => /lib64/libpython3.4m.so.1.0 (0x00007f0d93d08000)
     libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f0d93ae8000)
     libc.so.6 => /lib64/libc.so.6 (0x00007f0d93728000)
     libgfortran.so.3 => /lib64/libgfortran.so.3 (0x00007f0d93400000)
     libm.so.6 => /lib64/libm.so.6 (0x00007f0d930f8000)
     libdl.so.2 => /lib64/libdl.so.2 (0x00007f0d92ef0000)
     libutil.so.1 => /lib64/libutil.so.1 (0x00007f0d92ce8000)
     /lib64/ld-linux-x86-64.so.2 (0x00007f0d950e0000)
     libquadmath.so.0 => /lib64/libquadmath.so.0 (0x00007f0d92aa8000)
     libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f0d92890000)

Numpy is already linked to atlas, and I see a link to libpthread.so (so I assume it is already multithreated, is it?).

On the other side, I updated the version of numpy from 1.8.2 to 1.9.2 and now array method only takes 5 s instead of 300s. I think this is probably the reason of my code slowing down (maybe, did the system-adminstrator downgrade numpy version? who knows!)

The most probable issue could be that before the update `numpy` was linked to an optimized/multithreaded BLAS library for vector matrix operations (for instance OpenBlas, Blas Atlas, MKL etc), while it uses a slower reference implementation after the update. — rth, Apr 29 '15 at 00:19
Could you rather show the total time in your profiling? It doesn't matter so much that a routine was called 10^5 times, if the cumulated time spend there is low. Just to clarify, you python code by itself is not parallel, and rather relies on a multithreaded implementation of BLAS, right? — rth, Apr 29 '15 at 00:22
@rth I have edited my question with answers to your comments — simona, Apr 29 '15 at 10:43
Ummm... what was the previous version? I guess it was 2.x, then most probably this version of python didn't get removed. Instead of tracking what changed in python (although changes in python itself shouldn't make codes slower going from 2 to 3), and which libraries were linked to parallel math kernels and now are not, it might be easier just to use python 2 interpreter. My bet is `python` is now linked to `python3` but there should be `python2` still available. — luk32, Apr 29 '15 at 10:58
yes, I was using python2.7 on this cluster, but the problem is that the whole cluster was upgraded and all the libraries were reinstalled. The fact is that I was running this same code in python3 on another cluster, and there it worked just very well. The fact is that I have more computer time on the current cluster. — simona, Apr 29 '15 at 11:01
Then I am with @rth. One needs to properly install parallelized math libraries in order for numpy to use it. As suggested by the [installation guide for numpy](http://docs.scipy.org/doc/numpy/user/install.html). If the admin is not very experienced and the installation is fresh they might have just gone with default "install python python-numpy" approach which might be insufficient. — luk32, Apr 29 '15 at 11:21
ok, this is plausible and it was also my first guess. is there a way to check against which math libraries numpy is compiled? something like `ldd numpy_something` ? — simona, Apr 29 '15 at 11:32
@simona: [This question]([http://stackoverflow.com/questions/21671040/link-atlas-mkl-to-an-installed-numpy) and its answer gives good information about checking the configuration. — Jonathan Dursi, Apr 29 '15 at 12:52

score 1 · Answer 1 · answered May 25 '15 at 00:50

A parallelized BLAS only helps with a limited amount of numpy/scipy functions (see these test scripts);

numpy.dot
scipy.linalg.cholesky
scipy.linalg.svd

If you can run

import numpy.core._dotblas

without getting an ImportError, you have an optimized numpy.dot available.

Array creation speed should not be influenced by this, however.

Can you post your code and how you use it? Or else a minimal example that has the problem? How is your code run on the cluster?

parallelized methods in python

1 Answers1