We can try something like this, first a small example csv:
X = data.frame(id=1:1e5,matrix(runi(1e6),ncol=10))
write.csv(X,"test.csv",quote=F,row.names=FALSE)
You can use the nrow function, instead of providing a file, you provide a connection, and you store the results inside a list, for example:
data = vector("list",200)
con = file("test.csv","r")
data[[1]] = read.csv(con, nrows=1000)
dim(data[[1]])
COLS = colnames(data[[1]])
data[[1]] = data[[1]][,1:3]
head(data[[1]])
id X1 X2 X3
1 1 0.13870273 0.4480100 0.41655108
2 2 0.82249489 0.1227274 0.27173937
3 3 0.78684815 0.9125520 0.08783347
4 4 0.23481987 0.7643155 0.59345660
5 5 0.55759721 0.6009626 0.08112619
6 6 0.04274501 0.7234665 0.60290296
In the above, we read the first chunk, collected the colnames and subsetted. If you carry on reading through the connection, the headers will be missing, and we need to specify that:
for(i in 2:200){
data[[i]] = read.csv(con, nrows=1000,col.names=COLS,header=FALSE)[,1:3]
}
Finally, we build of all of those into a data.frame:
data = do.call(rbind,data)
all.equal(data[,1:3],X[,1:3])
[1] TRUE
You can see that I specified a much larger list than required, this is to show if you don't know how long the file is, as you specify something larger, it should work. This is a bit better than writing a while loop..
So we wrap it into a function, specifying the file, number of rows to read at one go, the number of times, and the column names (or position) to subset:
read_chunkcsv=function(file,rows_to_read,ntimes,col_subset){
data = vector("list",rows_to_read)
con = file(file,"r")
data[[1]] = read.csv(con, nrows=rows_to_read)
COLS = colnames(data[[1]])
data[[1]] = data[[1]][,col_subset]
for(i in 2:ntimes){
data[[i]] = read.csv(con,
nrows=rows_to_read,col.names=COLS,header=FALSE)[,col_subset]
}
return(do.call(rbind,data))
}
all.equal(X[,1:3],
read_chunkcsv("test.csv",rows_to_read=10000,ntimes=10,1:3))