1

I am reading a large csv file 25GB into pandas.DataFrame. My pc specifications are:

  • Intel core i7-8700 3.2GHz
  • RAM 16G
  • windows 10
  • DataFrame.shape =144,000,000 rows by 13 cols
  • csv file size on disk says 24GB

reading this file takes a long time like 20 minutes sometimes. Is there any recommendation, code wise, that I can do better?

*note: This DF is needed in whole, since I am going to Join(Merge) with another one.

Rebin
  • 516
  • 1
  • 6
  • 16
  • 3
    Look at the `chunksize` argument in `pd.read_csv()`: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html – Erfan Mar 19 '19 at 17:03
  • 1
    You can try to load it by chunks as shown [here](https://stackoverflow.com/questions/39384539/how-to-read-data-in-chunks-in-python-dataframe). Also if you can refactor your logic to process dataframe chunk-wise, you can do make an algorithm to work as shown [here](https://stackoverflow.com/questions/25962114/how-to-read-a-6-gb-csv-file-with-pandas). Maybe this will help. – Teoretic Mar 19 '19 at 17:08
  • Maybe I have to edit my questions. This DF is needed in whole, since I am going to Join(Merge) with another one. if I use chunk, do I have all at once? – Rebin Mar 19 '19 at 17:09
  • 1
    You might check out [dask](https://dask.org/). It's pretty good for these situations where data is larger than memory. – wbadart Mar 19 '19 at 17:18
  • 1
    You should considere changing the whole design. Pandas does a great job when everything fits into memory. But when the size of data if greater than available memory, it has to use some form of swapping to let its data reside on the disk. Joining a 25 Gb dataframe on a 16 Gb machine will be painfully slow, full stop. **You** know what you want to do, **I** don't, so I cannot guess whether a more efficient design is possible or not. – Serge Ballesta Mar 19 '19 at 17:21

1 Answers1

1

You could use a dask.dataframe:

import dask.dataframe as dd # import dask.dataframe
df = dd.read_csv('filename.csv') # read csv

or you could use chunking:

def chunk_processing(): # define a function that you will use on chunks
    ## Do Something # your function code here


chunk_list = [] # create an empty list to hold chunks
chunksize = 10 ** 6 # set chunk size
for chunk in pd.read_csv('filename.csv', chunksize=chunksize): # read in csv in chunks of chunksize
    processed_chunk = chunk_processing(chunk) # process the chunks with chunk_processing() function
    chunk_list.append(processed_chunk) # append the chunks to a list
df_concat = pd.concat(chunk_list) # concatenate the list to a dataframe
  • in the case of using chunks. if DF1(100M) wants to LEFT JOIN DF2(144M) and I did split(chunk) it into 10M from DF1 each time, I have to run the join 10times, each time joining 10M with 144M. You think this improves my code? to me looks like the same as joining 100M with 144M at once. No? – Rebin Mar 19 '19 at 17:31
  • It's a bit difficult to offer optimization advice without the seeing the code you are using. If the merge() is what is taking a long time, chunking probably won't speed it up that much, though it may still speed it up some. If it's the initial reading in of the CSV that is taking forever you should notice a significant improvement by using chunking. – nu11_hypothesis Mar 19 '19 at 18:22
  • the "dask" library is taking longer time compare to pandas. I have no idea why. – Rebin Mar 21 '19 at 17:47