Finally, I come to an issue that very slow data processing and appending rows of multiple data.frames
. I use lapply
and dplyr
combination for data processing. OTH, the process becomes very slower as I have 20000 rows in each data frame multiplied with 100 files in the directory.
Currently this is a huge bottle neck for me as even after lapply
process finishes I don't have enough memory to bind_rows
process.
Here is my data processing method,
first make a list of files
files <- list.files("file_directory",pattern = "w.*.csv",recursive=T,full.names = TRUE)
then process this list of files
library(tidyr)
library(dplyr)
data<- lapply(files,function(x){
tmp <- read.table(file=x, sep=',', header = T,fill=F,skip=0, stringsAsFactors = F,row.names=NULL)%>%
select(A,B, C)%>%
unite(BC,BC,sep='_')%>%
mutate(D=C*A)%>%
group_by(BC)%>%
mutate(KK=median(C,na.rm=TRUE))%>%
select(BC,KK,D)
})
data <- bind_rows(data)
I am getting an error which says,
“Error: cannot allocate vector of size ... Mb” ...
Depends on how much left in my ram. I have 8 Gb ram but seems still struggling;(
I also tried do.call but nothing changed! Who is my friendly function or approach for this issue? I use R version 3.4.2 and dplyr 0.7.4.