Pandas to_csv() slow saving large dataframe

Question

I'm guessing this is an easy fix, but I'm running into an issue that it's taking nearly an hour to save a pandas dataframe to a csv file using the to_csv() function. I'm using anaconda python 2.7.12 with pandas (0.19.1).

import os
import glob
import pandas as pd

src_files = glob.glob(os.path.join('/my/path', "*.csv.gz"))

# 1 - Takes 2 min to read 20m records from 30 files
for file_ in sorted(src_files):
    stage = pd.DataFrame()
    iter_csv = pd.read_csv(file_
                     , sep=','
                     , index_col=False
                     , header=0
                     , low_memory=False
                     , iterator=True
                     , chunksize=100000
                     , compression='gzip'
                     , memory_map=True
                     , encoding='utf-8')

    df = pd.concat([chunk for chunk in iter_csv])
    stage = stage.append(df, ignore_index=True)

# 2 - Takes 55 min to write 20m records from one dataframe
stage.to_csv('output.csv'
             , sep='|'
             , header=True
             , index=False
             , chunksize=100000
             , encoding='utf-8')

del stage

I've confirmed the hardware and memory are working, but these are fairly wide tables (~ 100 columns) of mostly numeric (decimal) data.

Thank you,

Hardware bottleneck. Keep a tab on your disk throughput, and also check for empty disk space. — Kartik, Nov 17 '16 at 16:46
As I mentioned, I did check the disk space and can copy large files to the drive with expected speed. Also, I should have mentioned I'm writing to an SSD (Samsung 950) — Kimi Merroll, Nov 17 '16 at 17:47
Try without the `chunksize` kwag... It could be a lot of things, like quoting, value conversion, etc. Try to [profile](https://docs.python.org/3.5/library/profile.html) it and see where it spends most of its time. — Kartik, Nov 17 '16 at 18:07
i have an ssd on pci express and face the same issue. hardware should not be the bottleneck in this case... — PlagTag, Jun 12 '17 at 15:16
See https://stackoverflow.com/a/54617862/6646912 for a benchmark showing how to speed up writing large DataFrames with mostly numeric data to a csv. — krassowski, Feb 10 '19 at 15:52
As an aside, it is totally pointless and unecessarily inefficient to use `chunksize=100000` and then simply `df = pd.concat([chunk for chunk in iter_csv])`. Just don't use the chunksize parameter then! You end up using *more* memory this way and it will definitely be slower — juanpa.arrivillaga, Oct 01 '22 at 18:31

score 16 · Answer 1 · answered Dec 19 '18 at 18:04

16

Adding my small insight since the 'gzip' alternative did not work for me - try using to_hdf method. This reduced the write time significantly! (less than a second for a 100MB file - CSV option preformed this in between 30-55 seconds)

stage.to_hdf(r'path/file.h5', key='stage', mode='w')

answered Dec 19 '18 at 18:04

Amir F

2,431
18
12

1

This solution works for me, while the .gz solution made no difference. .to_hdf method wrote out 1.5GB in 13 seconds. .to_csv took too long to time, even with changes suggested by Frane – alliedtoasters Jun 25 '19 at 20:18
1

Yes, the .gz solution made no difference made no difference for a file size of 5GB – Hardik Gupta Aug 30 '19 at 11:10
1

I went from 4 minutes with .to_csv, to 8 seconds with .to_hdf !!!! Thanks @amir-f !! – Paul Oct 23 '20 at 14:07

score 14 · Answer 2 · answered Sep 05 '17 at 10:34

14

You are reading compressed files and writing plaintext file. Could be IO bottleneck.

Writing compressed file could speedup writing up to 10x

    stage.to_csv('output.csv.gz'
         , sep='|'
         , header=True
         , index=False
         , chunksize=100000
         , compression='gzip'
         , encoding='utf-8')

Additionally you could experiment with different chunk sizes and compression methods (‘bz2’, ‘xz’).

answered Sep 05 '17 at 10:34

Frane

534
6
13

Frane, solution did not work. The time taken still remained the same – Shreesha N Mar 29 '19 at 06:45
@ShreeshaN what did you time? Execution of `to_csv` method or whole script? – Frane Mar 29 '19 at 09:45
Execution of to_csv – Shreesha N Mar 29 '19 at 12:13
@ShreeshaN what size and time are you talking about? Look at hdf format for alternative. If you need text/csv look at https://stackoverflow.com/a/54617862/6646912 as mention in his comment o question. – Frane Mar 29 '19 at 13:56
thanks for the alternatives. I tried hdf format, to_hdf() is super fast. It took 4 seconds to save a 600mb file with 4 lakh records where as to_csv even after using chunks and compression took 220 seconds. Thanks – Shreesha N Apr 01 '19 at 06:14
This solution doesn't work. Use to_hdf() (see other answer) – alliedtoasters Jun 25 '19 at 20:19
This does not work as the time taken is still the same – Mahendra Singh Mar 04 '20 at 09:03

score 9 · Answer 3 · edited Oct 03 '19 at 02:52

9

You said "[...] of mostly numeric (decimal) data.". Do you have any column with time and/or dates?

I saved an 8 GB CSV in seconds when it has only numeric/string values, but it takes 20 minutes to save an 500 MB CSV with two Dates columns. So, what I would recommend is to convert each date column to a string before saving it. The following command is enough:

df['Column'] = df['Column'].astype(str)

I hope that this answer helps you.

P.S.: I understand that saving as a .hdf file solved the problem. But, sometimes, we do need a .csv file anyway.

edited Oct 03 '19 at 02:52

Kevin

16,549
8
60
74

answered Oct 03 '19 at 02:41

lucas F

321
3
5

1

Also to handle the NaT values you may want df['Column'].astype(str).replace('NaT', '') – Evan Jones Apr 21 '20 at 02:58

stucash · Answer 4 · 2023-03-05T13:31:29.327

Try either Apache's parquet file format, or polars package, which is an alternative to the usual pandas.

I was trying to cache some data locally from my server, it has 59 millions rows on 9 columns; pandas.DataFrame.to_csv simply died therefore couldn't be timed.

I put a breakpoint on the way out and saved it down using parquet and read it back into polars dataframe (the reading wasn't timed but it was roughly 5-10 seconds):

[ins] In [6]:import polars as pl
[ins] In []:pf = pl.read_parquet('path_to_my_data.parquet')

I wrote this huge dataframe to csv using polars:

[ins] In [8]: %timeit pf.write_csv('path_to_my_data.csv')                                                                         
24.3 s ± 5.79 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

I casted polars dataframe to a pandas one and wrote it down using both hdf and parquet:

[ins] In [9]: df = pf.to_pandas() 
[ins] In [11]: %timeit df.to_parquet('path_to_data2.parquet')                                                                                       
11.7 s ± 138 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

[ins] In [12]: %timeit df.to_hdf('path_to_my_data.h5', key="stage", mode="w")  
15.4 s ± 723 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The parquet file was 1.8G whereas the h5 file was 4.3G. to_parquet from pandas has performed compression (snappy, gzip, or brotil), however we as end users don't need to decompress it.

Either of them can be a promising, if not exceeding, alternative if you need to deal with huge amount of data and data query back and forth is a must.

score 0 · Answer 5 · edited Jan 26 '23 at 18:51

0

I used to use to_csv() to output to company network drive which was too slow and took one hour to output 1GB csv file. just tried to output to my laptop C: drive with to_csv() statement, it only took 2 mins to output 1GB csv file.

edited Jan 26 '23 at 18:51

Syscall

19,327
10
37
52

answered Jan 26 '23 at 18:03

Bo Knows

1

Pandas to_csv() slow saving large dataframe

5 Answers5

Linked

Related