Python NUMPY HUGE Matrices multiplication

Question

I need to multiply two big matrices and sort their columns.

 import numpy
 a= numpy.random.rand(1000000, 100)
 b= numpy.random.rand(300000,100)
 c= numpy.dot(b,a.T)
 sorted = [argsort(j)[:10] for j in c.T]

This process takes a lot of time and memory. Is there a way to fasten this process? If not how can I calculate RAM needed to do this operation? I currently have an EC2 box with 4GB RAM and no swap.

I was wondering if this operation can be serialized and I dont have to store everything in the memory.

So all you are actually interested in are the 10 smallest scalar products? Then you can iterate through in blocks and throw a lot away on the way. — eickenberg, Jul 30 '14 at 21:56
That seems to be a nice idea.. how can I do so? I mean is there a pythonic way or do I have to write my own algorithm? — pg2455, Jul 30 '14 at 22:01
If you only need the 10 smallest you should have a look at [`np.argpartition`](http://docs.scipy.org/doc/numpy-1.8.1/reference/generated/numpy.argpartition.html), I think it will save you some time. For other suggestions: http://stackoverflow.com/q/6910641/2379410 — , Aug 01 '14 at 11:14
Have a look at dask https://www.dask.org/ It can automatically batch numpy like operations so they fit into memory. — C. Yduqoli, Mar 16 '23 at 02:20

Roland Smith · Accepted Answer · 2014-07-31T07:57:28.743

One thing that you can do to speed things up is compile numpy with an optimized BLAS library like e.g. ATLAS, GOTO blas or Intel's proprietary MKL.

To calculate the memory needed, you need to monitor Python's Resident Set Size ("RSS"). The following commands were run on a UNIX system (FreeBSD to be precise, on a 64-bit machine).

> ipython

In [1]: import numpy as np

In [2]: a = np.random.rand(1000, 1000)

In [3]: a.dtype
Out[3]: dtype('float64')

In [4]: del(a)

To get the RSS I ran:

ps -xao comm,rss | grep python

[Edit: See the ps manual page for a complete explanation of the options, but basically these ps options make it show only the command and resident set size of all processes. The equivalent format for Linux's ps would be ps -xao c,r, I believe.]

The results are;

After starting the interpreter: 24880 kiB
After importing numpy: 34364 kiB
After creating a: 42200 kiB
After deleting a: 34368 kiB

Calculating the size;

In [4]: (42200 - 34364) * 1024
Out[4]: 8024064

In [5]: 8024064/(1000*1000)
Out[5]: 8.024064

As you can see, the calculated size matches the 8 bytes for the default datatype float64 quite well. The difference is internal overhead.

The size of your original arrays in MiB will be approximately;

In [11]: 8*1000000*100/1024**2
Out[11]: 762.939453125

In [12]: 8*300000*100/1024**2
Out[12]: 228.8818359375

That's not too bad. However, the dot product will be way too large:

In [19]: 8*1000000*300000/1024**3
Out[19]: 2235.1741790771484

That's 2235 GiB!

What you can do is split up the problem and perfrom the dot operation in pieces;

load b as an ndarray
load every row from a as an ndarray in turn.
multiply the row by every column of b and write the result to a file.
del() the row and load the next row.

This wil not make it faster, but it would make it use less memory!

Edit: In this case I would suggest writing the output file in binary format (e.g. using struct or ndarray.tofile). That would make it much easier to read a column from the file with e.g. a numpy.memmap.

Hey thanks for the help. Can you explain what is *xao comm,rss* in your bash script.I got the full understanding of memory issues. Thanks a zillion. — pg2455, Jul 30 '14 at 22:37
It's also possible to get the resident set size directly in python by calling `resource.getrusage(resource.RUSAGE_SELF).ru_maxrss` https://docs.python.org/3/library/resource.html — vaer-k, Sep 19 '16 at 16:02

score 4 · Answer 2 · answered Jul 31 '14 at 02:42

What DrV and Roland Smith said are good answers; they should be listened to. My answer does nothing more than present an option to make your data sparse, a complete game-changer.

Sparsity can be extremely powerful. It would transform your O(100 * 300000 * 1000000) operation into an O(k) operation with k non-zero elements (sparsity only means that the matrix is largely zero). I know sparsity has been mentioned by DrV and disregarded as not applicable but I would guess it is.

All that needs to be done is to find a sparse representation for computing this transform (and interpreting the results is another ball game). Easy (and fast) methods include the Fourier transform or wavelet transform (both rely on similarity between matrix elements) but this problem is generalizable through several different algorithms.

Having experience with problems like this, this smells like a relatively common problem that is typically solved through some clever trick. When in a field like machine learning where these types of problems are classified as "simple," that's often the case.

That was an eye opener. Thanks a lot for sharing – pg2455 Jul 31 '14 at 04:31 — pg2455, Jul 31 '14 at 04:31

score 2 · Answer 3 · answered Jul 30 '14 at 21:41

YOu have a problem in any case. As Roland Smith shows you in his answer, the amount of data and number of calculations is enormous. You may not be very familiar with linear algebra, so a few words of explanation might help in understanding (and then hopefully solving) the problem.

Your arrays are both a collection of vectors with length 100. One of the arrays has 300 000 vectors, the other one 1 000 000 vectors. The dot product between these arrays means that you calculate the dot product of each possible pair of vectors. There are 300 000 000 000 such pairs, so the resulting matrix is either 1.2 TB or 2.4 TB depending on whether you use 32 or 64-bit floats.

On my computer dot multiplying a (300,100) array with a (100,1000) array takes approximately 1 ms. Extrapolating from that, you are looking at a 1000 s calculation time (depending on the number of cores).

The nice thing about taking a dot product is that you can do it piecewise. Keeping the output is then another problem.

If you were running it on your own computer, calculating the resulting matrix could be done in the following way:

create an output array as a np.memmap array onto the disk
calculate the results one row at a time (as explained by Roland Smith)

This would result in a linear file write with a largish (2.4 TB) file.

This does not require too many lines of code. However, make sure everything is transposed in a suitable way; transposing the input arrays is cheap, transposing the output is extremely expensive. Accessing the resulting huge array is cheap if you can access elements close to each other, expensive, if you access elements far away from each other.

Sorting a huge memmapped array has to be done carefully. You should use in-place sort algorithms which operate on contiguous chunks of data. The data is stored in 4 KiB chunks (512 or 1024 floats), and the fewer chunks you need to read, the better.

Now that you are not running the code in our own machine but on a cloud platform, things change a lot. Usually the cloud SSD storage is very fast with random accesses, but IO is expensive (also in terms of money). Probably the least expensive option is to calculate suitable chunks of data and send them to S3 storage for further use. The "suitable chunk" part depends on how you intend to use the data. If you need to process individual columns, then you send one or a few columns at a time to the cloud object storage.

However, a lot depends on your sorting needs. Your code looks as if you are finally only looking at a few first items of each column. If this is the case, then you should only calculate the first few items and not the full output matrix. That way you can do everything in memory.

Maybe if you tell a bit more about your sorting needs, there can be a viable way to do what you want.

Oh, one important thing: Are your matrices dense or sparse? (Sparse means they mostly contain 0's.) If your expect your output matrix to be mostly zero, that may change the game completely.

Hey, thanks for the help. Unfortunately my matrices are not sparse. I just read about np.memmap. It says it uses Python's Memory Map Object and does not allow files > 2GB. As you said it will take around 2TB for data is there a way to increase size of memmap. — pg2455, Jul 30 '14 at 22:20
Can you explain a bit about sorting on chunks of data? How is that possible in python? any specific library. As far as I understand my problem can be solved by multiplying one vector at a time with matrix and sorting it and then storing it in the file. — pg2455, Jul 30 '14 at 22:26
Modern versions of `memmap` do not have any practical limits with 64-bit Pythons, the 2 GB limit is an old one and may depend on the operating system. Also, your life is much easier if you only sort items within a row (transpose your matrix if you need), as then a simple `sort` sorts the row very efficiently. (I understood that you need to move columns or rows in a matrix, and that would have been more difficult.) — DrV, Jul 30 '14 at 22:49

Python NUMPY HUGE Matrices multiplication

3 Answers3

Linked