4

I have to multiply very large 2D-arrays in Python for around 100 times. Each matrix consists of 32000x32000 elements.

I'm using np.dot(X,Y), but it takes very long time for each multiplication... Below an instance of my code:

import numpy as np

X = None
for i in range(100)
    multiplying = True
    if X == None:
        X = generate_large_2darray()
        multiplying = False
    else:
        Y = generate_large_2darray()

    if multiplying:
        X = np.dot(X, Y)

Is there any other method much faster?

Update

Here is a screenshot showing the htop interface. My python script is using only one core. Also, after 3h25m only 4 multiplications have been done.

enter image description here

Update 2

I've tried to execute:

import numpy.distutils.system_info as info
info.get_info('atlas')

but I've received:

/home/francescof/.local/lib/python2.7/site-packages/numpy/distutils/system_info.py:564: UserWarning: Specified path /home/apy/atlas/lib is invalid. warnings.warn('Specified path %s is invalid.' % d) {}

So, I think it's not well-configured.

Vice versa, regarding blas I just receive {}, with no warnings or errors.

redcrow
  • 1,743
  • 3
  • 25
  • 45
  • 1
    Is there any way you can split the array into multiple sub-arrays and run on multiple threads or processes? – RevanProdigalKnight Jun 30 '14 at 13:04
  • What do you mean with "very long time"? – tamasgal Jun 30 '14 at 13:08
  • Well, yes... Actually I have 8 cores, so your idea could be very good. I could split my matrices in three parts each one. Do you have any fast method to do that? – redcrow Jun 30 '14 at 13:10
  • @septi: at least one hour for each multiplication. – redcrow Jun 30 '14 at 13:11
  • How much RAM do you have? A 32000x32000 array of 8 byte floats requires almost 8GB of RAM. You will need at least three such arrays (X, Y, and result) plus any temporary space dot may need. My guess is almost all the time during the calculation is taken up by swapping. – Craig J Copi Jun 30 '14 at 14:06
  • Can you test how long takes the multiplication with 8000 x 8000 matrices? If there is already a improved matrix multiplication (Strassen multiplication, O(n^log2(7)) behind the call then it should run for ~70 to 80 seconds if your numbers on the 32000x32000 multiplication were correct (~3600 seconds). – Vroomfondel Jun 30 '14 at 14:07
  • If your arrays are too big to fit in memory you can use `np.memmap` to store them on disk. You can call `np.dot()` directly on memmapped arrays, but unfortunately it's rather slow due to suboptimal caching behaviour. You might try [this workaround](http://stackoverflow.com/a/21096605/1461210) to speed things up. – ali_m Jun 30 '14 at 14:11
  • @Craig and ali_m: I have enough RAM (16 GB) to compute my multiplication, so no swapping is done. I'm sure because I use htop to monitor all processes and the RAM taken by my script is around 12GB (I'm using numpy.float32 types). – redcrow Jun 30 '14 at 14:13
  • What BLAS library are you using? – ali_m Jun 30 '14 at 14:15
  • 1
    Where does the data come from? Is it possible that you should be using sparse matrices for this? – YXD Jun 30 '14 at 14:46
  • 1
    As per @ali_m's question this all comes down to what BLAS you are using. With threaded Intel MKL on a Haswell i7 this takes about 5 minutes Your milage may vary, but this should be a reasonable lower bound to shoot for. – Daniel Jun 30 '14 at 15:29
  • If numpy is correctly build against a threaded blas library (e.g. gotoblas or mkl) you should see your multiplication using 8 threads on your machine. There is no need to split it further. – Bort Jun 30 '14 at 15:29
  • @MrE: From the CAIDA dataset, then, from that, I build a stochastic matrix, so I cannot work using sparse matrices. – redcrow Jun 30 '14 at 16:03
  • @Bort: Actually, my script uses just one core... How is it possible? – redcrow Jun 30 '14 at 16:04
  • @all: I've updated my post including a screenshot of htop. – redcrow Jun 30 '14 at 16:11
  • Version 1.2-2build1 of what? If you're using the standard CBLAS that's slow and single - threaded, and you'll get a big improvement in performance by switching to a multi-threaded BLAS such as OpenBLAS or MKL – ali_m Jun 30 '14 at 17:24
  • @ali_m: you're right, sorry. I executed: `dpkg -s libblas3gf`, so the answer is BLAS. I'll try to switch to OpenBLAS and let you know, thank you. – redcrow Jun 30 '14 at 17:59
  • 2
    OK, good luck - you might find [this guide](http://stackoverflow.com/a/14391693/1461210) helpful. – ali_m Jun 30 '14 at 18:00
  • Ok, got it... When I installed numpy I used `pypm` by ActivePython and not `pip`. But now I've discovered that `pypm` uses some default configurations, so, although I already had ATLAS in my linux, `pypm` didn't configure numpy according to my system, for that reason the `site.cfg` file had some strange paths to ATLAS. So, I removed numpy using `pypm` and then I re-installed it using `pip`. I've executed some tests and the performance are much better now. Thank you. – redcrow Jul 01 '14 at 09:40

2 Answers2

2

As suggested by ali_m, the using of a BLAS library can speed up the operations. However, the problem in my system was a bad configuration of numpy. Here is the solution:

1) make sure to have all required libraries (you can use ATLAS, OpenBLAS, etc.). I've chosen ATLAS in my case since directly supported in Ubuntu.

sudo apt-get install libatlas3gf-base libatlas-base-dev libatlas-dev

2) remove any previous numpy installations, e.g., pypm uninstall numpy (if you installed it using ActivePython)

3) install again numpy using pip: pip install numpy

4) make sure your atlas is correctly linked:

import numpy.distutils.system_info as info
info.get_info('atlas')

ATLAS version 3.8.4 built by buildd on Sat Sep 10 23:12:12 UTC 2011:
   UNAME    : Linux crested 2.6.24-29-server #1 SMP Wed Aug 10 15:58:57 UTC 2011 x86_64 x86_64 x86_64 GNU/Linux
   INSTFLG  : -1 0 -a 1
   ARCHDEFS : -DATL_OS_Linux -DATL_ARCH_HAMMER -DATL_CPUMHZ=1993 -DATL_USE64BITS -DATL_GAS_x8664
   F2CDEFS  : -DAdd_ -DF77_INTEGER=int -DStringSunStyle
   CACHEEDGE: 393216
   F77      : gfortran, version GNU Fortran (Ubuntu/Linaro 4.6.1-9ubuntu2) 4.6.1
   F77FLAGS : -fomit-frame-pointer -mfpmath=387 -O2 -falign-loops=4 -Wa,--noexecstack -fPIC -m64
   SMC      : gcc, version gcc (Ubuntu/Linaro 4.6.1-9ubuntu2) 4.6.1
   SMCFLAGS : -fomit-frame-pointer -mfpmath=387 -O2 -falign-loops=4 -Wa,--noexecstack -fPIC -m64
   SKC      : gcc, version gcc (Ubuntu/Linaro 4.6.1-9ubuntu2) 4.6.1
   SKCFLAGS : -fomit-frame-pointer -mfpmath=387 -O2 -falign-loops=4 -Wa,--noexecstack -fPIC -m64
{'libraries': ['lapack', 'f77blas', 'cblas', 'atlas'], 'library_dirs': ['/usr/lib/atlas-base/atlas', '/usr/lib/atlas-base'], 'define_macros': [('ATLAS_INFO', '"\\"3.8.4\\""')], 'language': 'f77', 'include_dirs': ['/usr/include/atlas']}
redcrow
  • 1,743
  • 3
  • 25
  • 45
  • 1
    You might be interested to know that OpenBLAS is also provided as a package in the Ubuntu repositories (`libopenblas-base` / `libopenblas-dev`). In my experience OpenBLAS has been quite a lot faster than ATLAS, but this is may depend on the model of your CPU. – ali_m Jul 01 '14 at 10:59
  • Yes, I know, thank you again. Indeed they are in repositories, but since I already had ATLAS, synaptic warned me that installing those libraries can break the other ones. I even installed a standalone OpenBLAS in `/opt/OpenBLAS`, but I think I cannot exploit it if numpy is installed by `pip`. It should be linkable to a different library only if numpy is installed from the source code (e.g., the github repository), right? – redcrow Jul 01 '14 at 12:00
  • 2
    If you'd prefer to use `pip` rather than building `numpy` from the git repository, you can still make it respect a particular configuration by creating a `.numpy-site.cfg` file in your home directory ([see here](http://stackoverflow.com/a/13796808/1461210)) – ali_m Jul 01 '14 at 12:51
  • 1
    Hey, just an update. I had to switch to OpenBLAS as suggested by ali_m because ATLAS didn't exploit the multithreading. Now my script exploits all the 12 cores and the operations are much much faster. – redcrow Jul 10 '14 at 20:08
1

Matrix multiplication is always expensive, specifically around O(n3). Performing this operation in Numpy is probably the fastest way to deal with it short of writing your own matrix multiplier in a compiled program that is "closer to the metal" (like C)... this would probably still be slower. I think you are doing this operation in the best way but you must realize that a 32000x32000 matrix is very large to be preforming any operations on, let alone matrix multiplication.

That was the bad news but here is the good news. I don't know what type of data you are working with but there can be, and often are, symmetries of the matrices in question which can greatly simplify the calculation. If your data is not entirely random there may be hope but you will have to look into the actual structure of the matrices you are working with. I suggest reading about some of the "special matrices" to see if your data might fall into one of those categories. Any information you find on the category your data should also discuss or cite efficient algorithms for managing expensive operations.

  • Thank you for your contribution. Please read all other comments to my post. There are answers to your questions. – redcrow Jun 30 '14 at 16:46
  • I'm sorry. I'm sure I am being shortsighted but it is still not clear to me what the source of the data in the matrices is... If you spelled it out for me perhaps I could be more helpful. – user2645976 Jun 30 '14 at 16:54
  • This is the dataset: https://snap.stanford.edu/data/as-caida.html. I build a DiGraph (using networkx), then I convert it to a right-stochastic graph (http://goo.gl/dAZT8G). At the end, I transform that graph in a fully-stochastic numpy 2D-array to execute my multiplication. – redcrow Jun 30 '14 at 17:17