15

I am trying to read a single column of a CSV file to R as quickly as possible. I am hoping to cut down on standard methods in terms of the time it takes to get the column into RAM by a factor of 10.

What is my motivation? I have two files; one called Main.csv which is 300000 rows and 500 columns, and one called Second.csv which is 300000 rows and 5 columns. If I system.time() the command read.csv("Second.csv"), it will take 2.2 seconds. Now if I use either of the two methods below to read the first column of Main.csv (which is 20% the size of Second.csv since it is 1 column instead of 5), it will take over 40 seconds. This is the same amount of time as it takes to read the whole 600 Megabyte file -- clearly unacceptable.

  • Method 1

    colClasses <- rep('NULL',500)
    
    colClasses[1] <- NA
    system.time(
    read.csv("Main.csv",colClasses=colClasses)
    ) # 40+ seconds, unacceptable
    
  • Method 2

     read.table(pipe("cut -f1 Main.csv")) #40+ seconds, unacceptable
    

How to reduce this time? I am hoping for an R solution.

user2763361
  • 3,789
  • 11
  • 45
  • 81
  • You can load your data into database and select only required column or use HDF5 files instead of csv. – zero323 Nov 02 '13 at 15:08
  • @zero323 I need something that can be `io` with all of: `Python`, `Java`, `R`. – user2763361 Nov 02 '13 at 15:10
  • `require(data.table); fread( "path/to/file/Main.csv" )` will give you an instant speed improvement. – Simon O'Hanlon Nov 02 '13 at 15:10
  • My rather [old POC package](http://cran.r-project.org/web/packages/saves/index.html) might be interesting here that provides a way to write a `data.frame` in a special binary format that can be used later for reading only a few variables at a time. Basically it's a wrapper around `save/readRDS` and writing the columns to separate files etc. More details: http://stackoverflow.com/questions/4756989/how-to-load-data-quickly-into-r – daroczig Nov 02 '13 at 15:13
  • @SimonO101 from 40 seconds to 15 seconds. Well done. – user2763361 Nov 02 '13 at 15:16
  • @SimonO101 Can you give an example how to read a single column? I used the same `colClasses` as in my example but it will just read the whole 600Mb data table (all 500 columns)? – user2763361 Nov 02 '13 at 15:20
  • 3
    Is your csv file really comma-separated? I would think that `scan(pipe("cut -f1 -d, Main.csv"))` might be worth a try. – Ben Bolker Nov 02 '13 at 15:35
  • @BenBolker Takes about 4 seconds. Might as well post the answer ('factor of 10' achieved!). – user2763361 Nov 02 '13 at 15:52
  • @user2763361 I think also `read.table( pipe("cat Main.csv | awk \"{ print $1}\"") )` would work? – Simon O'Hanlon Nov 02 '13 at 16:19
  • @SimonO101 I'll test it out later. Thank you for your continued contributions. – user2763361 Nov 02 '13 at 16:57
  • @SimonO101 : I think `read.table` will always be slower than `scan`. You can extract the first column using `awk`, `sed` (`sed -e s/,.*$//`), or `cut` -- I don't know which will be fastest but I suspect they'll be close, and `cut` is the most specialized tool for this task ... – Ben Bolker Nov 02 '13 at 17:12
  • @BenBolker thanks for the info (and thanks for pointing me to `cut` which I didn't know about before). +1 for your answer. – Simon O'Hanlon Nov 02 '13 at 20:55

2 Answers2

14

I would suggest

scan(pipe("cut -f1 -d, Main.csv"))

This differs from the original proposal (read.table(pipe("cut -f1 Main.csv"))) in a couple of different ways:

  • since the file is comma-separated and cut assumes tab-separation by default, you need to specify d, to specify comma-separation
  • scan() is much faster than read.table for simple/unstructured data reads.

According to the comments by the OP this takes about 4 rather than 40+ seconds.

Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
  • It is really amazing to see you can nest with Linux command lines before read the file. I can even put my Python cleaner in the pipe command to clean the data before I read the raw file! I am wondering is it possible to read a table from stdin line by line(each row is a line) using scan efficiently? – B.Mr.W. Nov 02 '13 at 16:53
  • Absolutely brilliant. – user2763361 Nov 02 '13 at 16:57
  • @B.Mr.W.: I'm afraid you're not going to be do very much better (I saw your question elsewhere, but your constraints are very strong: it's very hard to think of a way to read line-by-line in R without lots of overhead.) I don't think `scan` is going to be faster than `readLines`, but why don't you try it and see how it goes? – Ben Bolker Nov 02 '13 at 17:20
  • @BenBolker Actually you can use `fread` directly with a system command, so this... `fread( "cut -f1 -d, Main.csv" )` *could be* even quicker? – Simon O'Hanlon Nov 05 '13 at 10:18
  • maybe, but `scan()` really doesn't have very much overhead (in contrast to `read.table()`) – Ben Bolker Nov 05 '13 at 13:28
11

There is a speed comparison of methods to read large CSV files in this blog. fread is the fastest by an order of magnitude.

As mentioned in the comments above, you can use the select parameter to select which columns to read - so:

fread("main.csv",sep = ",", select = c("f1") ) 

will work

Thronk
  • 624
  • 14
  • 37
martino
  • 308
  • 7
  • 14
  • can you select which rows to read in? i.e., select rows by columns’ conditions? an `fread` equivalent of `SELECT col_1, col_2 FROM file WHERE col_3 > 30`. – Anarcho-Chossid Aug 31 '15 at 19:44