6

I have a program that outputs lines of CSV data that I want to load into a data frame. I currently load the data like so:

tmpFilename <- "tmp_file"
system(paste(procName, ">", tmpFilename), wait=TRUE)
myData <- read.csv(tmpFilename) # (I also pass in colClasses and nrows for efficiency)

However, I thought redirecting the output to a file just to read from it was inefficient (the program spits out about 30MB, so I want to handle it with optimal performance). I thought textConnection would solve this, so I tried:

con <- textConnection(system(procName, intern=TRUE))
myData <- read.csv(con)

This runs a lot slower, though, and whereas the first solution degrades linearly with input size, the textConnection solution's performance degrades exponentially it seems. The slowest part is creating the textConnection. read.csv here actually completes quicker than in the first solution since it's reading from memory.

My question is then, is creating a file just to run read.csv on it my best option with respect to speed? Is there a way to speed up the creation of a textConnection? bonus: why is creating a textConnection so slow?

Hudon
  • 1,636
  • 18
  • 28
  • It appears that you're creating an extra vector containing the entire output, with `intern=TRUE`. That vector will continue to grow as output is produced, which will continually allocate memory. If you split that line into two commands, `output <- system(procname, intern=TRUE); con <- textConnection(output)`, I highly suspect that the `system` command will be taking most of the time. – Matthew Lundberg May 17 '13 at 03:15
  • right, so if I split it up, the longest statement is the `textConnection(output)` call. By "longest", I mean `system` takes a couple seconds, `read.csv` takes another couple seconds, `textConnection()` takes 4 minutes. – Hudon May 17 '13 at 03:24
  • 1
    Wow. What OS is this? If it's Linux, I would create the file under `/dev/shm` which will use nothing but RAM. – Matthew Lundberg May 17 '13 at 03:26
  • yes, Linux. However, some users will be Windows users. Ideally, I'd get a solution for both, but that's still a handy tip – Hudon May 17 '13 at 03:28
  • 1
    Possibly helpful: http://stackoverflow.com/questions/1727772/quickly-reading-very-large-tables-as-dataframes-in-r – Ricardo Saporta May 17 '13 at 03:36
  • +1 for `data.table::fread` from the page Ricardo Saporta links to – Ben May 17 '13 at 03:44
  • @Ben Any reason you prefer `fread` to the highest-voted solution at that linked page (the one using SQL as an intermediary) ? – Carl Witthoft May 17 '13 at 11:44
  • @CarlWitthoft yes, with respect to speed, in my particular use-case `fread` is about 30x faster than SQL. It's also the fastest in @mnel's speed tests in his/her answer to that Q – Ben May 18 '13 at 05:09

1 Answers1

3

The "fastest way" will probably involve using something other than read.csv. However, sticking with read.csv, using pipe may be the way to go:

myData <- read.csv(pipe(procName))

It avoids reading the full text output into an intermediate buffer (at least before read.csv gets ahold of it).

Some timing comparisons:

> write.csv(data.frame(x=rnorm(1e5)), row.names=FALSE, file="norm.csv")
> system.time(d <- read.csv("norm.csv"))
   user  system elapsed 
  0.398   0.004   0.402 
> system.time(d <- read.csv(textConnection(system("cat norm.csv", intern=TRUE))))
   user  system elapsed 
 56.159   0.106  56.095 
> system.time(d <- read.csv(pipe("cat norm.csv")))
   user  system elapsed 
  0.475   0.012   0.531 
David F
  • 1,255
  • 9
  • 12
  • Just tried this and it seems to be precisely what I was looking for (shaved off some seconds). I had no idea about `pipe`. Thank you! – Hudon Jul 01 '13 at 12:29