2

Finally, I come to an issue that very slow data processing and appending rows of multiple data.frames. I use lapply and dplyr combination for data processing. OTH, the process becomes very slower as I have 20000 rows in each data frame multiplied with 100 files in the directory.

Currently this is a huge bottle neck for me as even after lapply process finishes I don't have enough memory to bind_rows process.

Here is my data processing method,

first make a list of files

files <- list.files("file_directory",pattern = "w.*.csv",recursive=T,full.names = TRUE)

then process this list of files

  library(tidyr)
  library(dplyr)

data<- lapply(files,function(x){
    tmp <- read.table(file=x, sep=',', header = T,fill=F,skip=0, stringsAsFactors = F,row.names=NULL)%>%

      select(A,B, C)%>%
      unite(BC,BC,sep='_')%>%

      mutate(D=C*A)%>%
      group_by(BC)%>%
      mutate(KK=median(C,na.rm=TRUE))%>%
      select(BC,KK,D)
  })

data <- bind_rows(data)

I am getting an error which says,

“Error: cannot allocate vector of size ... Mb” ...

Depends on how much left in my ram. I have 8 Gb ram but seems still struggling;(

I also tried do.call but nothing changed! Who is my friendly function or approach for this issue? I use R version 3.4.2 and dplyr 0.7.4.

camille
  • 16,432
  • 18
  • 38
  • 60
Alexander
  • 4,527
  • 5
  • 51
  • 98
  • 2
    Try using data.table functions (fread, update by reference etc). I would also suggest to read the files without furrther manipulation, then rbind with an id and then run the analysis only once, grouped by "BC" and your new file-id. – talat Oct 12 '17 at 14:16
  • @r2evans Thanks for the comments. What do you mean about `never use it` I supposed that `%>%` operator passing every files in the list after reading to `select` function ? – Alexander Oct 12 '17 at 14:19
  • Do you get an error? Warning? What indication do you have that something is not right? You need to provide more information. (You also should include any and all packages being used, likely at least `dplyr` and `tidyr`.) – r2evans Oct 12 '17 at 14:19
  • Alexander, sorry, the first `%>%` was a bit off the screen to the right when I commented ... I removed my comment when I re-read the question, my bad. – r2evans Oct 12 '17 at 14:20
  • @r2evans No problem;) I updated the last part about what error I see and the packages in use. – Alexander Oct 12 '17 at 14:23
  • Good error message. I see your "20000 rows" and "100 files", but the error suggests that you have a metric \*\*\*-load of columns. (I routinely work with 2-3 times that data with no issue, and though I have 16GB of RAM, I don't see R's memory usage spike over 5.) – r2evans Oct 12 '17 at 14:26
  • @r2evans I just confirmed that i have a 30 M rows if I would merge all files! Is there a way I can put `data.table` functions like `dplyr` inside of the `lapply` or are there any different answer of this? – Alexander Oct 12 '17 at 14:30
  • @r2evans the problem is that most of the posts seems that directly binding the data.frames. I could not find similar solution binding datas after processing;( – Alexander Oct 12 '17 at 14:31
  • Try `ldply` from the `plyr` package instead of `lapply` to directly process output into a data.frame – manotheshark Oct 12 '17 at 14:35
  • @docendodiscimus Could you send your suggestion as an answer ? – Alexander Oct 12 '17 at 14:36
  • @manotheshark Could you send your suggestion as answer with each process line ? – Alexander Oct 12 '17 at 14:37
  • Now I will search for posts but I am sure there is little amount of posts similar to this post. – Alexander Oct 12 '17 at 14:38
  • 1
    I would suggest as @r2evans points out that you modify your read.csv arguments to select the columns you want first. This should alleviate memory issues. A solution can be found [here](https://stats.stackexchange.com/questions/16796/reading-only-two-out-of-three-columns-with-read-csv) – jacobsg Oct 12 '17 at 14:54

2 Answers2

4

I can't test this answer since there's no reproducible data but I guess it could be something like the following, using data.table:

library(data.table)

data <- setNames(lapply(files, function(x) {
  fread(x, select = c("A", "B", "C"))
}), basename(files))

data <- rbindlist(data, use.names = TRUE, fill = TRUE, id = "file_id")
data[, BC := paste(B, C, sep = "_")]
data[, D := C * A]
data[, KK := median(C, na.rm = TRUE), by = .(BC, file_id)]
data[, setdiff(names(data), c("BC", "KK", "D")) := NULL]
talat
  • 68,970
  • 21
  • 126
  • 157
2

Using ldply from the plyr package would eliminate the need to bind the list after processing as it will output a data.frame

library(tidyr)
library(dplyr)
library(plyr)

files <- list.files("file_directory", pattern = "w.*.csv", recursive = TRUE, full.names = TRUE)

data<- ldply(files, function(x){
  read.table(file=x, sep=',', header = TRUE, fill = FALSE, skip = 0, stringsAsFactors = FALSE, row.names = NULL) %>%
    select(A, B, C) %>%
    unite(BC, BC, sep='_') %>%
    mutate(D = C * A) %>%
    group_by(BC) %>%
    mutate(KK = median(C, na.rm = TRUE)) %>%
    select(BC, KK, D)
})
manotheshark
  • 4,297
  • 17
  • 30
  • Thanks man. I will compare the speeds and feedback the results here asap! – Alexander Oct 12 '17 at 14:55
  • 1
    If the files are large typically `data.table` will run faster, but `ldply` has a `.parallel` option which offers about a 50% speed increase when reading a high number of files. – manotheshark Oct 12 '17 at 14:59