1

I have a file bigger than 7GB. I am trying to place it into a dataframe using pandas, like this:

df = pd.read_csv('data.csv') 

But it takes too long. Is there a better way to speed up the dataframe creation? I was considering changing the parameter engine='c', since it says in the documentation:

"engine{‘c’, ‘python’}, optional
Parser engine to use. The C engine is faster while the python engine is currently more feature-complete."

But I dont see much gain in speed

Ignacio Alorre
  • 7,307
  • 8
  • 57
  • 94
  • 2
    chunk it up and then do the data analysis in parts. https://stackoverflow.com/questions/44729727/pandas-slice-large-dataframe-in-chunks – Paul Brennan Jan 13 '21 at 16:53
  • 1
    reading `csv` files is a fairly slow process. If this is a file you expect to Import/Output frequently then you should pay the upfront cost of reading the csv once, and save it in a format that pandas can read much more quickly: https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#performance-considerations. Based on their timings, `.pkl` files can be read nearly 50x faster than .csv files – ALollz Jan 13 '21 at 16:55
  • Do you use the same CSV many times? If so, save it in something like parquet or arrow after you've managed to get it into memory once. – tdelaney Jan 13 '21 at 16:55
  • maybe take a look at `Dask`which is very similar to `Pandas`but supports multicore and handle large dataset. https://docs.dask.org/en/latest/dataframe.html. Kr. – antoine Jan 13 '21 at 16:57
  • @PaulBrennan thanks, I will look into it. Seems useful – Tony Balboa Jan 13 '21 at 16:58
  • Place it in a dB instead save on the I/O cost how many times are you reading said file ? – Umar.H Jan 13 '21 at 16:59
  • @ALollz Thanks for the approach, I am not sure if I can modify the input file, but will try it – Tony Balboa Jan 13 '21 at 17:01
  • @tdelaney just read it once, then I keep operating with it to train a model – Tony Balboa Jan 13 '21 at 17:02

1 Answers1

1

If the problem is you are not able to create the dataframe since the big size makes the operation to fail, you can check how to chunk it in this answer

In case it is created at some point, but you consider it is too slow, then you can use datatable to read the file, then convert to pandas, and continue with your operations:

import pandas as pd 
import datatable as dt

# Read with databale
datatable_df = dt.fread('myfile.csv')

# Then convert the dataframe into pandas
pandas_df = frame_datatable.to_pandas()
Ignacio Alorre
  • 7,307
  • 8
  • 57
  • 94
  • Thanks for the answer. But later how slow is to convert from datatable to pandas? because if it takes too long, may not worth the entire process – Tony Balboa Jan 13 '21 at 16:59
  • @TonyBalboa actually not much longer than the reading operation. It will depend on the machine where you are running it, but the entire operation of reading + converting should be like a few seconds. – Ignacio Alorre Jan 13 '21 at 17:05
  • Thank you, actually it is much faster indeed – Tony Balboa Jan 13 '21 at 17:22