1

This is my code to generate files by home id. Then I will analyze each home seperately.

import pandas as pd
data = pd.read_csv("110homes.csv")
for i in (np.unique(data['dataid'])):
    print i
    d1 = pd.DataFrame(data[data['dataid']==i])
    k = str(i)
    d1.to_csv(k + ".csv")

However, I am getting this error. The machine has 200 GB RAM and it is showing memory error too:

    data = pd.read_csv("110homes.csv")
  File "/usr/lib/python2.7/site-packages/pandas/io/parsers.py", line 474, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/usr/lib/python2.7/site-packages/pandas/io/parsers.py", line 260, in _read
    return parser.read()
  File "/usr/lib/python2.7/site-packages/pandas/io/parsers.py", line 721, in read
    ret = self._engine.read(nrows)
  File "/usr/lib/python2.7/site-packages/pandas/io/parsers.py", line 1170, in read
    data = self._reader.read(nrows)
  File "pandas/parser.pyx", line 769, in pandas.parser.TextReader.read (pandas/parser.c:7544)
  File "pandas/parser.pyx", line 819, in pandas.parser.TextReader._read_low_memory (pandas/parser.c:8137)
  File "pandas/parser.pyx", line 1833, in pandas.parser._concatenate_chunks (pandas/parser.c:22383)
MemoryError
dsl1990
  • 1,157
  • 5
  • 13
  • 25
  • Nothing to do with your MemoryError, but have a look at the `df.groupby` method. This would make your code after the read more elegant – MaxNoe Feb 20 '16 at 18:06
  • 2
    Are you using 32-bit Python? It has a 4G limit -- see [pandas-memory-error](http://stackoverflow.com/questions/23205005/pandas-memory-error/23207756#23207756) – RootTwo Feb 20 '16 at 18:57
  • import struct print struct.calcsize("P") * 8 I ran this command and it shows 64. So that is not the problem I guess – dsl1990 Feb 20 '16 at 19:43
  • 1
    Why is question downvoted? If you can't answer the question then atleast don't downvote it. It is ridiculous someone simply downvoting question without any reason. It's a legit question. – dsl1990 Feb 20 '16 at 20:13

1 Answers1

1

Data in RAM can take a lot more space than on disk. Without seeing your 110homes.csv file, it's impossible to know details, but imagine that it consists of 10 floating point numbers per line, like: 0.0,1.0,2.0,.... In the CSV, each takes 3 bytes + 1 byte for the delimiter. In Python, each takes 8 bytes (on a 64 byte machine) for the float, plus 2 bytes per Unicode char (another 8 bytes), plus 8 bytes for string length, plus 8 bytes per pointer, plus bytes per row, etc.

Think about it like this: On a 64 bit machine, the minimum size for a pointer, a native int, or a native float, is 8 bytes. You need several of those per field, and several more per row. There's nothing unusual about taking 15x in RAM versus disk.

Do a simple test: Take the first 10% of the lines of your file, and monitor python via top as it processes. See how much RAM it uses. Does it use at least 20 GB?

SRobertJames
  • 8,210
  • 14
  • 60
  • 107