Creating a large sparse matrix in scipy.sparse

Question

I am using scipy.sparse in my application and want to do some performance tests. In order to do that, I need to create a large sparse matrix (which I will then use in my application). As long as the matrix is small, I can create it using the command

import scipy.sparse as sp
a = sp.rand(1000,1000,0.01)

Which results in a 1000 by 1000 matrix with 10.000 nonzero entries (a reasonable density meaning approximately 10 nonzero entries per row)

The problem is when I try to create a larger matrix, for example, a 100.000 by 100.000 matrix (I have dealt with way larger matrices before), I run

import scipy.sparse as sp
N = 100000
d = 0.0001
a = sp.rand(N, N, d)

which should result in a 100.000 by 100.000 matrix with one million nonzero entries (way in the realm of possible), I get an error message:

Traceback (most recent call last):
  File "<pyshell#6>", line 1, in <module>
    sp.rand(100000,100000,0.0000001)
  File "C:\Python27\lib\site-packages\scipy\sparse\construct.py", line 723, in rand
    j = random_state.randint(mn)
  File "mtrand.pyx", line 935, in mtrand.RandomState.randint (numpy\random\mtrand\mtrand.c:10327)
OverflowError: Python int too large to convert to C long

Which is some annoying internal scipy error I cannot remove.

I understand that I can create a 10*n by 10*n matrix by creating one hundred n by n matrices, then stacking them together, however, I think that scipy.sparse should be able to handle the creation of large sparse matrices (I say again, 100k by 100k is by no means large, and scipy is more than comfortable handling matrices with several million rows). Am I missing something?

This is probably because it's picking the *random* entries to give your matrix by selecting a `32 bit int` between 0 and `N*M`, and the max 32 bit (signed) int is `2^31-1` (`100,000*100,000 = 10,000,000,000 > 2,147,483,647 = 2^31-1`). Building it in blocks using [`bmat`](http://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.bmat.html#scipy.sparse.bmat) is probably the easiest work around. Try making `N*M = 2^31-2` and then `2^31` and see if that causes the problem to pop up. — will, Feb 24 '15 at 10:59
I can't edit that previous comment anymore, but that error is consistent with what i describe: `Python int too large to convert to C long` and the limits in the [climits](http://www.cplusplus.com/reference/climits/) header. — will, Feb 24 '15 at 11:07
This probably occurs only on 32-bit Python, which is probably why the bug wasn't noticed earlier. — pv., Feb 24 '15 at 11:12
as Jan-Philip Gehrcke poitns out below, it is system dependent - I think you should be able to have a look in `stdint.h` on your system though and see what your limits are. — will, Feb 24 '15 at 11:22
I opened an [issue for the wrong error message](https://github.com/scipy/scipy/issues/4557) — cgohlke, Feb 25 '15 at 00:48

score 3 · Accepted Answer · edited May 23 '17 at 11:50

Without getting to the bottom of the issue, you should make sure that you are using a 64 bit build on a 64 bit architecture, on a Linux platform. There, the native "long" data type is of 64 bit size (as opposed to Windows, I believe).

For reference, see these tables:

http://www.unix.org/whitepapers/64bit.html (-> long is 64 bit on LP64)
http://en.wikipedia.org/wiki/64-bit_computing#64-bit_data_models

Edit: Maybe I was not explicit enough before -- on a 64 bit Windows, the classical native "long" data type is of 32 bit size (also see this question). This might be a problem in your case. That is, your code might just work when you change platform to Linux. I cannot say this with absolute certainty, because it really depends on which native data types are used in the numpy/scipy C source (of course there are 64 bit data types available on Windows, and usually a platform case analysis is performed with compiler directives, and proper types are chosen via macros -- I cannot really imagine that they've used 32 bit data types by accident).

Edit 2:

I can provide three data samples supporting my hypothesis.

Debian 64 bit, Python 2.7.3 and SciPy 0.10.1 binaries from Debian repos:

Python 2.7.3 (default, Mar 13 2014, 11:03:55)
[GCC 4.7.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import scipy; print scipy.__version__; import scipy.sparse as s; s.rand(100000, 100000, 0.0001).shape
0.10.1
(100000, 100000)

Windows 7 64 bit, 32 bit Python build, 32 bit SciPy 0.10.1 build, both from ActivePython:

ActivePython 2.7.5.6 (ActiveState Software Inc.) based on
Python 2.7.5 (default, Sep 16 2013, 23:16:52) [MSC v.1500 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import scipy; print scipy.__version__; import scipy.sparse as s; s.rand(100000, 100000, 0.0001).shape
0.10.1
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\user\AppData\Roaming\Python\Python27\site-packages\scipy\sparse\construct.py", line 426, in rand
    raise ValueError(msg % np.iinfo(tp).max)
ValueError: Trying to generate a random sparse matrix such as the product of dimensions is
greater than 2147483647 - this is not supported on this machine

Windows 7 64 bit, 64 bit ActivePython build, 64 bit SciPy 0.15.1 build (from Gohlke, build against MKL):

ActivePython 3.4.1.0 (ActiveState Software Inc.) based on
Python 3.4.1 (default, Aug  7 2014, 13:09:27) [MSC v.1600 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import scipy; scipy.__version__; import scipy.sparse as s; s.rand(100000, 100000, 0.0001).shape
'0.15.1'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python34\lib\site-packages\scipy\sparse\construct.py", line 723, in rand
    j = random_state.randint(mn)
  File "mtrand.pyx", line 935, in mtrand.RandomState.randint (numpy\random\mtrand\mtrand.c:10327)
OverflowError: Python int too large to convert to C long

I am using a 64 bit build on a 64bit Python on a 64bit windows 7 platform. — 5xum, Feb 24 '15 at 11:43
As I do not have a Linux platform to test your assumption on, I can only guess that you are correct — 5xum, Feb 24 '15 at 12:16
Also, there are no official 64 bit builds of numpy available for Windows -- what did you install, actually? Did you use http://www.lfd.uci.edu/~gohlke/pythonlibs/#numpy? — Dr. Jan-Philip Gehrcke, Feb 24 '15 at 13:11
Yes, I used the unofficial binary. It worked well for me in the past. — 5xum, Feb 24 '15 at 13:13
Gohlke's builds are created with Intel's compiler suite. It *could* be that this data type "confusion" is a weakness of these compilers. I am not sure which compilers others (third party Python distributions) are using, but maybe you want to try Enthought or ActiveState or Anaconda Python. They all bring their own builds of NumPy. It could be that one of them does not suffer from what you observe. — Dr. Jan-Philip Gehrcke, Feb 24 '15 at 15:06

Creating a large sparse matrix in scipy.sparse

1 Answers1