1

I am trying to be really specific about my issue. I have a dataframe with some 200+ columns and 1mil+ rows. I am reading or writing it to a excel file which takes more than 45 mins if I recorded right.

df = pd.read_csv("data_file.csv", low_memory=False, header=0, delimiter = ',', na_values = ('', 'nan'))
df.to_excel('data_file.xlsx', header=0, index=False)

My question- is there anyway we can read or write faster to a file with pandas dataframe because this is just one file example. I have many more such files with me

DeshDeep Singh
  • 1,817
  • 2
  • 23
  • 43

3 Answers3

2

Two thoughts:

  • Investigate Dask, which provides a Pandas like DataFrame that can distribute processing of large datasets across multiple CPUs or clusters. Hard to say to what degree you will get a speed up, if your performance is purely IO bound, but certainly worth investigating. Take a quick look at the Dask use cases to get an understanding of its capabilities.

  • If you are going to repeatedly read the same CSV input files, then I would suggest converting these to HDF, as reading HDF is orders of magnitude faster than reading the equivalent CSV file. It's as simple as reading the file into a DataFrame and then writing it back out using DataFrame.to_hdf(). Obviously this will only help if you can do this conversion as a once off exercise, and then use the HDF files from that point forward whenever you run your code.

Regards, Ian

Ian Ash
  • 1,087
  • 11
  • 23
0

That is a big file you are working with. If you need to process the data then you can't really get around the long read and write times.

0

Do NOT write to xlsx, use csv, writing to xlsx is taking long time. Write to csv. It takes a minute on my cheap laptop with SSD.

Frank
  • 317
  • 2
  • 11