1

I have a file with >5 million rows and 20 fields. I would like to open it in Pandas, but got an out of memory error:

pandas.parser.CParserError: Error tokenizing data. C error: out of memory

I have then read up some posts on similar issues and discovered Blaze, but following three methods (.Data, .CSV, .Table), none worked apparently.

# coding=utf-8
import pandas as pd
from pandas import DataFrame, Series
import re
import numpy as np
import sys
import blaze as bz
reload(sys)
sys.setdefaultencoding('utf-8')

# Gave an out of memory error
'''data = pd.read_csv('file.csv', header=0, encoding='utf-8', low_memory=False)
df = DataFrame(data)

print df.shape
print df.head'''

data = bz.Data('file.csv')

# Tried the followings too, but no luck
'''data = bz.CSV('file.csv')
data = bz.Table('file.csv')'''

print data
print data.head(5)

Output:

_1
_1.head(5)
[Finished in 1.0s]
Community
  • 1
  • 1
KubiK888
  • 4,377
  • 14
  • 61
  • 115
  • 2Gb csv could be at least 4Gb (and IIRC it needs twice that to parse), but will depend on columns, see http://stackoverflow.com/questions/18089667/how-to-estimate-how-much-memory-a-pandas-dataframe-will-need try it with the first hundred thousand / million and see how much RAM you're using. It may not be possible, so chunk it and dump it (to pytables). – Andy Hayden Oct 15 '15 at 03:51

1 Answers1

1

Blaze

For the bz.Data(...) object you'll have to actually do something to get a result. It loads the data as needed. If you were at a terminal and typed in

>>> data

you would get the head repr-ed out to the screen. If you need to use the print function then try

bz.compute(data.head(5))

dask.dataframe

You might also consider using dask.dataframe, which has a similar (though subsetted) API to pandas

>>> import dask.dataframe as dd
>>> data = dd.read_csv('file.csv', header=0, encoding='utf-8')
MRocklin
  • 55,641
  • 23
  • 163
  • 235