2

When I use the following code to load a csv file using numpy

F = np.loadtxt(F,skiprows=1, delimiter=',',usecols=(2,4,6))
MASS = F[:,4]
#print(MASS)
Z = F[:,6]
N = len(MASS)
print(len(MASS))

I get the following error

Traceback (most recent call last):
File "C:\Users\Codes\test2.py", line 16, in <module>
F = np.loadtxt(F,skiprows=1, delimiter=',',usecols=(2,4,6))
File "C:\Python34\lib\site-packages\numpy\lib\npyio.py", line 859, in   loadtxt
X.append(items)
MemoryError

I have 24Gb if physical memory and the file is 2.70Gb so I do not understand why I am getting this error. Thanks!

EDIT

I also tried to load the same file like this

f = open(F)           #Opens file
f.readline()          # Strips Header
nlines = islice(f, N) #slices file to only read N lines


for line in nlines:              
 if line !='':
      line = line.strip()
      line = line.replace(',',' ') #Replace comma with space
      columns = line.split()
      tid = columns[2]
      m = columns[4]  
      r = columns[6]               # assigns variable to columns
      M.append(m)
      R.append(r)                       #appends data in list
      TID.append(tid)



print(len(MASS))      

and got another memory error.

 Traceback (most recent call last):
  File "C:\Users\Loop test.py", line 58, in <module>
     M.append(m)
    MemoryError

It seems like in this case it is running out of memory when building the first list M

Stripers247
  • 2,265
  • 11
  • 38
  • 40
  • 3
    allocating a 2.70Gb array is problematic. It needs to be 2.70 Gb of **continuous** memory. Any machine will struggle with that. I would recommend finding a way to process this data in chunks. Or look into the memmap functionality of numpy: http://docs.scipy.org/doc/numpy/reference/generated/numpy.memmap.html – Marijn van Vliet Feb 13 '15 at 13:59
  • 4
    You only need 2.7 GB of contiguous *virtual* memory, and there shouldn't be any problem of providing this, since the virtual address space is insanely big on modern machines. It's more likely that `loadtxt()` will copy the data, may be multiple times, and internally allocate more data than the original file would occupy. – Sven Marnach Feb 13 '15 at 14:07
  • @SvenMarnach is there any way og checking that `loadtext()` is making extra copies? – Stripers247 Feb 13 '15 at 14:11
  • @SvenMarnach is probably right here. – Marijn van Vliet Feb 13 '15 at 14:15
  • what happens when you allocate memory like: X = np.zeros(1e9)? (where 1e9 should be the size of your dataset) – Marijn van Vliet Feb 13 '15 at 14:18
  • @Rodin I get a memory error `Traceback (most recent call last): File "", line 1, in numpy.zeros(1e9) MemoryError` – Stripers247 Feb 13 '15 at 14:20
  • TIL: Python's memory management probably sucks – Marijn van Vliet Feb 13 '15 at 14:21
  • (Do yo have a MATLAB install you can compare with? he asked in a hushed voice) – Marijn van Vliet Feb 13 '15 at 14:23
  • Yes I do hold on I will check – Stripers247 Feb 13 '15 at 14:24
  • I'm on a linux machine here with 16 GB of memory. X = np.zeros(1e9) returns in a fraction of a second, perfectly content. X.nbytes = 8000000000 – Marijn van Vliet Feb 13 '15 at 14:26
  • Matlab gives `EDU>> x=zeros(1e+9) Error using zeros Maximum variable size allowed by the program is exceeded.` – Stripers247 Feb 13 '15 at 14:29
  • I have a feeling windows is limiting the memory Python or matlab can use. I will look into this after class. – Stripers247 Feb 13 '15 at 14:30
  • try `x=zeros(1, 1e9)`, otherwise it attempts to generate a (1e9 x 1e9) matrix – Marijn van Vliet Feb 13 '15 at 14:30
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/70856/discussion-between-surfcast23-and-rodin). – Stripers247 Feb 13 '15 at 14:34
  • 1
    `loadtxt` is somewhat inefficient. `pandas.read_csv` is much more efficient than `loadtxt`, but you can also "roll your own" loadtxt-alike easily that will be much more memory-friendly. For what it's worth, have a look at http://stackoverflow.com/questions/8956832/python-out-of-memory-on-large-csv-file-numpy/8964779#8964779 – Joe Kington Feb 13 '15 at 14:34
  • @Rodin `EDU>> x=zeros(1,1e9) Maximum variable size allowed by the program is exceeded.` – Stripers247 Feb 13 '15 at 14:41

2 Answers2

6

First off, I'd check that you're actually using a 64-bit build of python. On Windows, it's common to wind up with 32-bit builds, even on 64-bit systems.

Try:

import platform
print(platform.architecture()[0])

If you see 32bit, that's your problem. A 32-bit execuctable can only address 2GB of memory, so you can never have an array (or other object) over 2GB.


However, loadtxt is rather inefficient because it works by building up a list and then converting it to a numpy array. Your example code does the same thing. (pandas.read_csv is much more efficient and very heavily optimized, if you happen to have pandas around.)

A list is a much less memory-efficient structure than a numpy array. It's analogous to an array of pointers. In other words, each item in a list has an additional 64-bits.

You can improve on this by using numpy.fromiter if you need "leaner" text I/O. See Python out of memory on large CSV file (numpy) for a more complete discussion (shameless plug).


Nonetheless, I don't think your problem is loadtxt. I think it's a 32-bit build of python.

Community
  • 1
  • 1
Joe Kington
  • 275,208
  • 71
  • 604
  • 463
  • You are correct it says that I am running the `32bit` version of `Python 3.4', but the 64bit` version of `Python 3.3.2`. Should I just uninstall and reinstall install the `64bit` version `3.4`? – Stripers247 Feb 13 '15 at 14:56
  • @Surfcast23 - That's entirely up to you. There's nothing wrong with using `3.3` if that's what's installed as 64-bit and working. – Joe Kington Feb 13 '15 at 20:58
1

The problem I believe is the requirement of continuous memory to load 2.7GB data. It is most probably more than 2.7 in memory as well because of the data structure and language utilities as well. It is better to use chunks of the same file or using HDF5 like data structures. http://www.h5py.org/

erogol
  • 13,156
  • 33
  • 101
  • 155