is there a Faster way to write or read in/to with pandas data frame with about 1 million row

Question

I am trying to be really specific about my issue. I have a dataframe with some 200+ columns and 1mil+ rows. I am reading or writing it to a excel file which takes more than 45 mins if I recorded right.

df = pd.read_csv("data_file.csv", low_memory=False, header=0, delimiter = ',', na_values = ('', 'nan'))
df.to_excel('data_file.xlsx', header=0, index=False)

My question- is there anyway we can read or write faster to a file with pandas dataframe because this is just one file example. I have many more such files with me

Try using [xlsxwriter](https://stackoverflow.com/questions/17684610/python-convert-csv-to-xlsx) to convert your file directly from .csv to .xlsx. — Theo Vasileiadis, Jul 12 '18 at 12:16
@theodoros: no no my purpose is different I have to do processing and modelling on data too that is just for example for reading and writing. — DeshDeep Singh, Jul 12 '18 at 12:19
Checkout feather. If it can be used for application it will optimize your task greatly — Sreeram TP, Jul 12 '18 at 13:00

Ian Ash · Answer 1 · 2018-07-12T17:53:12.353

Two thoughts:

Investigate Dask, which provides a Pandas like DataFrame that can distribute processing of large datasets across multiple CPUs or clusters. Hard to say to what degree you will get a speed up, if your performance is purely IO bound, but certainly worth investigating. Take a quick look at the Dask use cases to get an understanding of its capabilities.
If you are going to repeatedly read the same CSV input files, then I would suggest converting these to HDF, as reading HDF is orders of magnitude faster than reading the equivalent CSV file. It's as simple as reading the file into a DataFrame and then writing it back out using DataFrame.to_hdf(). Obviously this will only help if you can do this conversion as a once off exercise, and then use the HDF files from that point forward whenever you run your code.

Regards, Ian

score 0 · Answer 2 · answered Jul 12 '18 at 12:33

0

That is a big file you are working with. If you need to process the data then you can't really get around the long read and write times.

answered Jul 12 '18 at 12:33

Theo Vasileiadis

199
9

score 0 · Answer 3 · answered Jul 12 '18 at 20:09

0

Do NOT write to xlsx, use csv, writing to xlsx is taking long time. Write to csv. It takes a minute on my cheap laptop with SSD.

answered Jul 12 '18 at 20:09

Frank

317
2
11

I dont think that would solve my problem. I have business requirement. – DeshDeep Singh Jul 13 '18 at 07:11

is there a Faster way to write or read in/to with pandas data frame with about 1 million row

3 Answers3