3

I'm trying to read a 13GB csv file in using the following code:

chunks=pd.read_csv('filename.csv',chunksize=10000000)
df=pd.DataFrame()
%time df=pd.concat(chunks, ignore_index=True)

I have played with the values of chunksize parameter from 10 ** 3 to 10 ** 7, but everytime I receive a MemoryError. The csv file has about 3.3 Million rows and 1900 columns.

I clearly see that I have 30+GB memory available before I start reading the file, but I'm still getting the MemoryError. How do I fix this?

Patthebug
  • 4,647
  • 11
  • 50
  • 91
  • Probably pandas is not made to deal with such big files, you should try using dask and its pandas-like dataframe. – Ignacio Vergara Kausel Jun 27 '17 at 18:36
  • Are you using 64 bit `python`? – Andrew L Jun 27 '17 at 18:37
  • @AndrewL: yes, I'm using 64bit python – Patthebug Jun 27 '17 at 18:39
  • Why do you need to chunk it if it's significantly smaller than memory? Have you tried just `df=pd.read_csv('filename.csv')`? It looks like your code is needlessly assigning the data to multiple variables. – Andrew L Jun 27 '17 at 18:41
  • Possible duplicate of ["Large data" work flows using pandas](https://stackoverflow.com/questions/14262433/large-data-work-flows-using-pandas) – Ignacio Vergara Kausel Jun 27 '17 at 18:41
  • @AndrewL: yes, I tried that, but kept running out of memory even then. – Patthebug Jun 27 '17 at 18:41
  • I second @IgnacioVergaraKausel comment on `dask` if you can't seem to work with it in `pandas`. – Andrew L Jun 27 '17 at 18:43
  • 1
    Simple math - even if we consider a DF where all columns are of 1-byte dtype (bool, int8, uint8): `40*10**6*1900/1024**3` - `70.78 GiB`. I can't understand how is it possible that this CSV file takes only 13GB, as we need space for delimiters... – MaxU - stand with Ukraine Jun 27 '17 at 18:43
  • @AndrewL: ahh, that's something totally new I'll haver to deal with then. I'm not sure if the properties of dataframes will be available in dask. Also not sure how friendly would dask be for Machine LEarning applications. I guess I will have to read up on it. But I secretly believe there's a solution to this problem. – Patthebug Jun 27 '17 at 18:44
  • @MaxU: you totally got me. I have about 3.3 Million rows, not 40 Million. – Patthebug Jun 27 '17 at 18:46
  • I'd go for Apache Spark + Hadoop (HDFS) if you need to have the whole DF in memory – MaxU - stand with Ukraine Jun 27 '17 at 18:46
  • what are the dtypes of your 1900 columns? – MaxU - stand with Ukraine Jun 27 '17 at 18:53
  • 2
    What's the data type of the thing you're reading? And do you really really need to have the whole thing in memory? – pvg Jun 27 '17 at 18:53
  • @MaxU @pvg These are mostly dummy variables, so just 0s and 1s. Some of the other columns are integers, but we don't have any strings or dates in there. I do want the whole thing in memory (I think) coz I would like to implement a simple `RandomForestClassifier` on this dataset. P.S. This dataset is a small subset of the real data that I have – Patthebug Jun 27 '17 at 19:01
  • If __most__ (80+%) of your values are zeros - i'd consider to create a SparseDataFrame out of it... – MaxU - stand with Ukraine Jun 27 '17 at 19:03
  • You should tell pandas what the types are, especially if they if they are small integers and what @MaxU is telling you about sparse data. The other thing is just making your subset smaller so it fits into memory. It's hard to imagine any model that that gets insanely worse when you feed it 10 instead of 15 gb of data or whatever. – pvg Jun 27 '17 at 19:07
  • @MaxU: great suggestion, thank you! I will try to create a SparseDataFrame and see how it goes. – Patthebug Jun 27 '17 at 19:10

1 Answers1

0

Chunking is doing nothing in the case where you want to read everything in the file. The whole purpose of chunk is to pre-process the chunk so that you then only work with the data in which you are interested (possibly writing the processed chunk to disk). In addition, it appears that your chunk size is larger than the number of rows in your data, meaning that you are reading the whole file in one go anyhow.

As suggested by @MaxU, try sparse data frames, and also use a smaller chunk size (e.g. 100k):

chunks = pd.read_csv('filename.csv', chunksize=100000)  # nrows=200000 to test given file size.
df = pd.concat([chunk.to_sparse(fill_value=0) for chunk in chunks])

You may also want to consider something like GraphLab Create which uses SFrames (not limited by RAM).

Alexander
  • 105,104
  • 32
  • 201
  • 196