Pandas error while reading 14 GB csv file on 200 GB RAM workstation

Question

This is my code to generate files by home id. Then I will analyze each home seperately.

import pandas as pd
data = pd.read_csv("110homes.csv")
for i in (np.unique(data['dataid'])):
    print i
    d1 = pd.DataFrame(data[data['dataid']==i])
    k = str(i)
    d1.to_csv(k + ".csv")

However, I am getting this error. The machine has 200 GB RAM and it is showing memory error too:

    data = pd.read_csv("110homes.csv")
  File "/usr/lib/python2.7/site-packages/pandas/io/parsers.py", line 474, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/usr/lib/python2.7/site-packages/pandas/io/parsers.py", line 260, in _read
    return parser.read()
  File "/usr/lib/python2.7/site-packages/pandas/io/parsers.py", line 721, in read
    ret = self._engine.read(nrows)
  File "/usr/lib/python2.7/site-packages/pandas/io/parsers.py", line 1170, in read
    data = self._reader.read(nrows)
  File "pandas/parser.pyx", line 769, in pandas.parser.TextReader.read (pandas/parser.c:7544)
  File "pandas/parser.pyx", line 819, in pandas.parser.TextReader._read_low_memory (pandas/parser.c:8137)
  File "pandas/parser.pyx", line 1833, in pandas.parser._concatenate_chunks (pandas/parser.c:22383)
MemoryError

Nothing to do with your MemoryError, but have a look at the `df.groupby` method. This would make your code after the read more elegant — MaxNoe, Feb 20 '16 at 18:06
Are you using 32-bit Python? It has a 4G limit -- see [pandas-memory-error](http://stackoverflow.com/questions/23205005/pandas-memory-error/23207756#23207756) — RootTwo, Feb 20 '16 at 18:57
import struct print struct.calcsize("P") * 8 I ran this command and it shows 64. So that is not the problem I guess — dsl1990, Feb 20 '16 at 19:43
Why is question downvoted? If you can't answer the question then atleast don't downvote it. It is ridiculous someone simply downvoting question without any reason. It's a legit question. — dsl1990, Feb 20 '16 at 20:13

score 1 · Answer 1 · answered Dec 29 '16 at 04:24

Data in RAM can take a lot more space than on disk. Without seeing your 110homes.csv file, it's impossible to know details, but imagine that it consists of 10 floating point numbers per line, like: 0.0,1.0,2.0,.... In the CSV, each takes 3 bytes + 1 byte for the delimiter. In Python, each takes 8 bytes (on a 64 byte machine) for the float, plus 2 bytes per Unicode char (another 8 bytes), plus 8 bytes for string length, plus 8 bytes per pointer, plus bytes per row, etc.

Think about it like this: On a 64 bit machine, the minimum size for a pointer, a native int, or a native float, is 8 bytes. You need several of those per field, and several more per row. There's nothing unusual about taking 15x in RAM versus disk.

Do a simple test: Take the first 10% of the lines of your file, and monitor python via top as it processes. See how much RAM it uses. Does it use at least 20 GB?

Pandas error while reading 14 GB csv file on 200 GB RAM workstation

1 Answers1