4

I'm looking for a way of reading only a few columns from a csv file into R using shell() or pipe.

I found this thread that explains how to accomplish that on Linux: Quicker way to read single column of CSV file

On Linux this works adding the what argument:

a <-as.data.frame(scan(pipe("cut -f1,2 -d, Main.csv"),
                       what=list("character","character"),sep= ","))

However this doesn't seem to work on Windows.

When using pipe("cut -f1 -d, Main.csv") the connection gets opened but it doesn't return anything.

What would be the functions/syntax I need to use in order to make this work on Windows.

Is is possible to accomplish this by using shell()?

Thanks,

Diego

Community
  • 1
  • 1
Diego
  • 824
  • 2
  • 9
  • 18

2 Answers2

6

Make sure that cut is on your path - its in Rtools. This works for me:

# check that cut is availble
Sys.which("cut")

# create test data
Lines <- "a,b,c
1,2,3
4,5,6"
cat(Lines, file = "in.csv")

# read it
DF <- read.csv(pipe("cut -f1,2 -d, in.csv"))

Added

Rtools is now Rtools40 and cut is at C:\Rtools40\usr\bin\cut.exe .

G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • Thank you G. I think I don't have installed RTools in my windows because it uses Cygwin and I had troubles in the past with Ruby's active record library. I'll have to double-check that when I get back to my Windows machine, but if that works for you probably that's the problem. Thanks again. – Diego May 30 '14 at 22:03
  • 2
    You can install Rtools without putting it on your path. In that case its unlikely there would be any conflicts. Once installed that way try: `DF <- read.csv(pipe("\\Rtools\\bin\\cut -f1,2 -d, in.csv"))` – G. Grothendieck May 31 '14 at 00:56
  • Nice G. It works well. Below you can see the results of my tests. For a reason I don't understand yet, the number of rows is not accurate with most of the implementations. Only fread() was accurate and faster than the default read.csv(). Do you know the reason, or how I can improve that behaviour? – Diego May 31 '14 at 11:20
  • Try cutting down the example data until you get it to produce a not-intended number of rows with only a few rows of input. – G. Grothendieck May 31 '14 at 12:57
  • Thank you G. Not sure if I'm going to explore this, but certainly if I do it I'll post back. Thanks again – Diego Jun 01 '14 at 17:06
2
> system.time(a <- read.csv("in.csv"))
   user  system elapsed 
   1.24    0.04    1.26 
> dim(a)
[1] 4706   46
> system.time(b <-read.csv(pipe("C:/Rtools/bin/cut -f1,2 -d, in.csv")))
   user  system elapsed 
   0.22    1.27    2.37 
Warning message:
In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  :
  EOF within quoted string
> dim(b)
[1] 2726    2
> system.time(d <-as.data.frame(scan(pipe("C:/Rtools/bin/cut -f1,2 -d, in.csv"),
+                        what=list("character","character"),sep= ",")))
Read 1715 records
   user  system elapsed 
   0.31    1.19    2.47 
Warning message:
In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  :
  EOF within quoted string
> dim(d)
[1] 1715    2
> library(data.table)
data.table 1.9.2  For help type: help("data.table")
Warning message:
closing unused connection 3 (C:\Windows\system32\cmd.exe /c C:/Rtools/bin/cut -f1,2 -d, in.csv) 
> system.time(e <-fread("C:/Rtools/bin/cut -f1,2 -d, in.csv"))
   user  system elapsed 
   0.02    0.01    0.80 
> dim(e)
[1] 4706    2
Diego
  • 824
  • 2
  • 9
  • 18
  • It is more clean to use library(rbenchmark). Then you can show all functions at a time. Example: benchmark(f1(x), f2(x), f3(x), f4(x), columns = c("test", "replications", "elapsed", "relative"), order = "relative", relative = "elapsed", replications = 10). (In this case substitue f1(x) by fread("C:/Rtools/bin/cut -f1,2 -d, in.csv")), and so on. Anyway, data.table wins. +1 – luis_js May 31 '14 at 22:47
  • Thank you Luis. Didn't know that library! – Diego Jun 01 '14 at 17:03