0

I follow here some post here How to combine multiple .csv files in R? and here Reading Many CSV Files at the Same Time in R and Combining All into one dataframe

My purpose is basically the same: combining into one big matrix multiples, very large, csv file in R. I have this solution that I would like to speed up as much as possible:

Here a fully reproducible example; I have much more and bigger files

 setwd("C:/") #### set an easy directory to create acceptably large files
 #### this takes about 60 seconds
 for(i in 1:80){
   print(80-i)
   write.table(matrix(rnorm(20*3891,0,1),ncol=20),col.names=F,row.names=F,sep=",",file=paste(i,"file.csv",sep=""))
 }
 listfiles<-list.files(path="C:/",pattern="*.csv")
 #### now the problem: this takes about 30-40 seconds; as I have bigger (and much more) files I want to speed up this step
 library(plyr)
 mybigmatrix<-ldply(listfiles,read.csv,header=F)

Thanks in advance for any help

Community
  • 1
  • 1
  • 1
    [Quickly reading very large tables as dataframes in R](http://stackoverflow.com/questions/1727772/quickly-reading-very-large-tables-as-dataframes-in-r) is a good post for the reading part of your question. Then check the top answers [here](http://stackoverflow.com/search?tab=votes&q=%5br%5d%20fread%20rbindlist) for reading and combining several files using `fread` and `rbindlist`. – Henrik Nov 24 '16 at 10:53
  • Thanks for your response; I found that rbindlist(lapply(listfilenames,fread)) is VERY fast but it returns a data.table object and I cannot coerce it to a matrix. Any suggestion? – Paolo Piras Nov 24 '16 at 18:44
  • Hard to tell what you mean by "_I cannot coerce_" without a (minimal) reproducible example. `as.matrix(data.table(x = 1:2, y = 3:4))` works fine for me. – Henrik Nov 24 '16 at 18:50
  • I cannot do it when I convert the data.table in matrix. Maybe I'm wrong somewhere....I have also created a wrapper of fread that reads only numeric.........Please look at my answer below that shows an example. – Paolo Piras Nov 25 '16 at 09:57

3 Answers3

0

maybe the use of special packages and functions like readr and the function read_csv()

mybigmatrix<-ldply(listfiles,readr::read_csv,header=F)
drmariod
  • 11,106
  • 16
  • 64
  • 110
  • Thanks for your response; I found that rbindlist(lapply(listfilenames,fread)) is VERY fast but it returns a data.table object and I cannot coerce it to a matrix. Any suggestion? – Paolo Piras Nov 24 '16 at 18:42
0

Here a fully reproducible example that shows a problem with fread() that does not allow me to coerce in matrix the data.table object.

 setwd("C:/") #### set an easy directory to create acceptably large files
 #### this takes few seconds
 for(i in 1:5){
   print(5-i)
   write.table(matrix(rnorm(5*3891,0,1),nrow=5),col.names=F,row.names=F,sep=",",file=paste(i,"file.csv",sep=""))
 }
 listfiles<-list.files(path="C:/",pattern="*.csv")


 myfread<-function(file){
 data_frame <- fread(file,sep=",",header=FALSE,stringsAsFactors=FALSE,select=c(1:3891),colClasses=c(rep("as.numeric",3891)))
 data_frame
 }

    ######  this is a matrix 25*3891 I want an array of 1297x3x25
    alld<-rbindlist(lapply(listfiles,myfread)) 
    ### why this is in characters??
     as.matrix(alld)
    k<-1297
     m<-3
    vectorr<-as.vector(t(as.matrix(alld)))
    tem <- vectorr
    n <- length(tem)/(k * m)
   tem <- array(tem, c(m, k, n))
   tem <- aperm(tem, c(2, 1, 3))
   xup <- tem #######  here I have characters
0

I think any of these options should work well for you.


setwd("C:/Users/your_path_here/test")
fnames <- list.files()
csv <- lapply(fnames, read.csv)
result <- do.call(rbind, csv)

filedir <- setwd("C:/Users/your_path_here/csv_files")
file_names <- dir(filedir)
your_data_frame <- do.call(rbind,lapply(file_names,read.csv))

filedir <- setwd("C:/Users/your_path_here/csv_files")
file_names <- dir(filedir)
your_data_frame <- do.call(rbind, lapply(file_names, read.csv, skip = 1, header = FALSE))

filedir <- setwd("C:/Users/your_path_here/csv_files")
file_names <- dir(filedir)
your_data_frame <- do.call(rbind, lapply(file_names, read.csv, header = FALSE))

temp <- setwd("C:/Users/Excel/Desktop/test")
temp = list.files(pattern="*.csv")
myfiles = lapply(temp, read.delim)

Finally, try this:

setwd("C:/Users/your_path_here/")

file_list <- list.files()

file_list <- list.files("C:/Users/your_path_here/")

for (file in file_list){

  # if the merged dataset doesn't exist, create it
  if (!exists("dataset")){
    dataset <- read.table(file, header=TRUE, sep="\t")
  }

  # if the merged dataset does exist, append to it
  if (exists("dataset")){
    temp_dataset <-read.table(file, header=TRUE, sep="\t")
    dataset<-rbind(dataset, temp_dataset)
    rm(temp_dataset)
  }

}