I am trying to read a single column of a CSV
file to R
as quickly as possible. I am hoping to cut down on standard methods in terms of the time it takes to get the column into RAM by a factor of 10.
What is my motivation? I have two files; one called Main.csv
which is 300000 rows and 500 columns, and one called Second.csv
which is 300000 rows and 5 columns. If I system.time()
the command read.csv("Second.csv")
, it will take 2.2 seconds. Now if I use either of the two methods below to read the first column of Main.csv
(which is 20% the size of Second.csv
since it is 1 column instead of 5), it will take over 40 seconds. This is the same amount of time as it takes to read the whole 600 Megabyte file -- clearly unacceptable.
Method 1
colClasses <- rep('NULL',500) colClasses[1] <- NA system.time( read.csv("Main.csv",colClasses=colClasses) ) # 40+ seconds, unacceptable
Method 2
read.table(pipe("cut -f1 Main.csv")) #40+ seconds, unacceptable
How to reduce this time? I am hoping for an R
solution.