I want to match R
's data.table::fwrite
csv
file writing speed in Python.
Let's check some timings. First R
...
library(data.table)
nRow=5e6
nCol=30
df=data.frame(matrix(sample.int(100,nRow*nCol,replace=TRUE),nRow,nCol))
ta=Sys.time()
fwrite(x=df,file="/home/cryo111/test2.csv")
tb=Sys.time()
tb-ta
#Time difference of 1.907027 secs
The same for Python using pandas.to_csv
import pandas as pd
import numpy as np
import datetime
nRow=int(5e6)
nCol=30
df = pd.DataFrame(np.random.randint(0,100,size=(nRow, nCol)))
ta=datetime.datetime.now()
df.to_csv("/home/cryo111/test.csv")
tb=datetime.datetime.now()
(tb-ta).total_seconds()
#96.421676
Currently there is a huge performance gap. One main reason might be that fwrite
uses all cores for the write process, whereas to_csv
is probably only single-threaded.
I wasn't able to find any Python packages that have out-of-the-box csv
file writers that could match data.table::fwrite
. Have I missed something? Is there another way to speed up the write process?
The file size is in both cases around 400MB. The code ran on the same machine.
I have tried Python 2.7
, 3.4
, 3.5
. I am using R 3.3.2
and data.table 1.10.4
. On Python 3.4, I was using pandas 0.20.1