1

I have over 2000 *.rds files that are 18,000 obs and I would like to merge them all into one data.frame. I know it's not efficient to grow objects with for loops and have done so, but is taking really long as it progresses. Is there a faster way to do this?

Here are a couple of things I've tried so far:

lapply:

mergedat <- do.call(rbind, lapply(list.files("dat/"), readRDS))

for loop:

files <- list.files("dat/")
merged <- data.frame()
for (i in unique(files)){
df <- readRDS(paste0("dat/", i))
merged <- rbind(merged, df)
}
Vedda
  • 7,066
  • 6
  • 42
  • 77
  • 2
    In some extremes, quoting the function might be faster (ala `do.call("rbind", ...`). In this case, though, have you tried `dplyr::bind_rows`? In my tests (using small data.frames), it runs almost an order of magnitude faster. – r2evans Jan 05 '16 at 06:18
  • @r2evans I have not tried `dplyr` option; I guess I'll give that a go too. – Vedda Jan 05 '16 at 06:19
  • 2
    `data.table::rbindlist` would also be worth checking out. – Gregor Thomas Jan 05 '16 at 06:38
  • @Gregor This was it. It merged everything in less than a minute with `rbind` took over 20. thanks! – Vedda Jan 05 '16 at 06:52
  • 1
    FYI, the first `do.call` is missing a trailing `)` – steveb Jan 05 '16 at 06:55

1 Answers1

2

It looks like you were close with your first attempt. With a slight modifications, the following worked for me

mergedat <- do.call('rbind', lapply(list.files("dat/", full.names = TRUE), readRDS))

The only thing I added was full.names = TRUE.

steveb
  • 5,382
  • 2
  • 27
  • 36
  • 1
    OP didn't say it doesn't work. OP said it takes time to finish. –  Jan 05 '16 at 06:44
  • Thanks, but both of those I suggested work. They just take a long time. Any way to speed things up? – Vedda Jan 05 '16 at 06:44
  • Ok, got it. There was a typo in the original post, I will comment on it there. – steveb Jan 05 '16 at 06:54
  • 2
    Perhaps I should delete this answer and Gregor should move the suggestion to an "Answer" to get credit. – steveb Jan 05 '16 at 07:01
  • whoa this is an old post but you can use, dplyr::bind_rows which is much faster. – Ahdee Sep 11 '21 at 16:01
  • @Ahdee yes, this is old, I forget about this question. Since this question is considered closed because an answer exists, I don't know if it is worth updating here. In addition to your suggestion, I have been using the following to quickly load many large data files (10s of millions of rows): `my.dt <- data.table::rbindlist(lapply(fnames, function(fn) data.table::fread(fn)))`. This code has run significantly faster than what I have previously done. I believe there may also be other options that are faster. – steveb Sep 11 '21 at 16:48
  • @steveb yes agree, data.table or the dplyr's bind_rows is so much faster. I was confused because of some weird thing that I needed to stich together >9K worth of dataframe and was surpise that after many hours it was still going. with bin_rows it took less than 30 mins. thanks. – Ahdee Sep 11 '21 at 17:03
  • also the duplication alert is not correct since this is about speed and not so much about merging multiple tables. – Ahdee Sep 11 '21 at 17:04