3

I have a 50,000x5,000 matrix(float) file. when use x = np.genfromtxt(readFrom, dtype=float) to load the file into memory, I am getting the following error message:

File "C:\Python27\lib\site-packages\numpy\lib\npyio.py", line 1583, in genfromtxt for (i, converter) in enumerate(converters)])
MemoryError

I want to load the whole file into memory because I am calculating the euclidean distance between each vectors using Scipy. dis = scipy.spatial.distance.euclidean(x[row1], x[row2])

Is there any efficient way to load a huge matrix file into memory.

Thank you.

Update:

I managed to solve the problem. Here is my solution. I am not sure whether it's efficient or logically correct but works fine for me:

x = open(readFrom, 'r').readlines()
y = np.asarray([np.array(s.split()).astype('float32') for s in x], dtype=np.float32)
....
dis = scipy.spatial.distance.euclidean(y[row1], y[row2])

Please help me to improve my solution.

Maggie
  • 5,923
  • 8
  • 41
  • 56
  • 1
    Calculating the distance for all pairs of vectors will take much longer than loading the file. Recheck if you really need all vector pairs. Also, you are going to need at least 25 * 10^7 * 4 = 10^9 bytes, perhaps 2*10^9 bytes -- the latter would be infeasible on a 32-bit system. – krlmlr Jul 14 '12 at 16:28
  • have a look at http://stackoverflow.com/q/1896674/1301710 – bmu Jul 14 '12 at 17:49

2 Answers2

1

You're actually using 8 byte floats since python's float corresponds to C's double (at least on most systems):

a=np.arange(10,dtype=float)
print(a.dtype)  #np.float64

You should specify your data type as np.float32. Depending on your OS, and whether it is 32bit or 64bit, (and whether you're using 32bit python vs. 64bit python), the address space available for numpy to use could be smaller than your 4Gb which could be an issue here as well.

mgilson
  • 300,191
  • 65
  • 633
  • 696
  • Even if I use `dtype=np.float32`, I am getting the memory error. – Maggie Jul 14 '12 at 16:34
  • @Mahin What happens if you just do: `>>> a=np.zeros((50000,5000),dtype=np.float32); a=1` instead of your `np.genfromtxt`? – mgilson Jul 14 '12 at 16:45
  • 1
    @Mahin -- My numpy is too old to support `genfromtxt`, but looking at the source `loadtxt` (which is supposedly equivalent if you don't have missing values), however, numpy is reading the values into a list (which is at least `4*N*sizeof(pointer)*N` bytes long). Then (i think) the data gets copied again when the numpy array is constructed since ndarrays are contiguous in memory. I would suggest you iterate over the file yourself and pack the values (after allocating memory with np.zeros). It should be relatively easy to do since you know the size of the array. – mgilson Jul 14 '12 at 17:06
1

Depending on your OS and Python version, it's quite likely that you'll never be able to allocate a 1GB array (mgilson's answer is spot on here). The problem is not that you're running out of memory, but that you're running out of contiguous memory. If you're on a 32-bit machine (especially running Windows), it will not help to add more memory. Moving to a 64-bit architecture would probably help.

Using smaller data types can certainly help; depending on the operations you use, a 16-bit float or even an 8-bit int might be sufficient.

If none of this works, then you're forced to admit that the data just doesn't fit in memory. You'll have to process it piecewise (in this case, storing the data as an HDF5 array might be very useful).

Luke
  • 11,374
  • 2
  • 48
  • 61