13

I want to match R's data.table::fwrite csv file writing speed in Python.

Let's check some timings. First R...

library(data.table)

nRow=5e6
nCol=30
df=data.frame(matrix(sample.int(100,nRow*nCol,replace=TRUE),nRow,nCol))

ta=Sys.time()
fwrite(x=df,file="/home/cryo111/test2.csv")
tb=Sys.time()

tb-ta
#Time difference of 1.907027 secs

The same for Python using pandas.to_csv

import pandas as pd
import numpy as np
import datetime

nRow=int(5e6)
nCol=30
df = pd.DataFrame(np.random.randint(0,100,size=(nRow, nCol)))

ta=datetime.datetime.now()
df.to_csv("/home/cryo111/test.csv")
tb=datetime.datetime.now()

(tb-ta).total_seconds()
#96.421676

Currently there is a huge performance gap. One main reason might be that fwrite uses all cores for the write process, whereas to_csv is probably only single-threaded.

I wasn't able to find any Python packages that have out-of-the-box csv file writers that could match data.table::fwrite. Have I missed something? Is there another way to speed up the write process?

The file size is in both cases around 400MB. The code ran on the same machine. I have tried Python 2.7, 3.4, 3.5. I am using R 3.3.2 and data.table 1.10.4. On Python 3.4, I was using pandas 0.20.1

cryo111
  • 4,444
  • 1
  • 15
  • 37
  • 1
    Great question, this post might be worth a look: https://stackoverflow.com/q/15417574/6163621 – elPastor May 25 '17 at 12:53
  • @pshep123 Thanks for the link. I have already seen this question in my research for this problem. Seems as if `pandas 0.11` was improving on the OP solution's timing results. I am, however, already using `pandas 0.20` which is still much slower than what can be practically achieved (as with `fwrite`). – cryo111 May 25 '17 at 18:23
  • 1
    Unfortunately, my question was put on hold. Anyhow, a quick update for you guys who think that they could call `data.table::fwrite` from `python` via the package `rpy2`. Turns out that this is really slow. My guess is that the `python -> R` object conversion is too slow (for data around 3GB). – cryo111 May 28 '17 at 10:06
  • 1
    I have now moved to using the `parquet` file format. One could also use `hdf5` but an important requirement for me was that the file format can be understood by `h2o`'s data import routines (as of now, `hdf5` is not supported, but `parquet` is). The respective read/write routines can be found in the `pyarrow` python package. That's not really an answer to my initial question but might help some people. – cryo111 Jul 25 '17 at 15:17

0 Answers0