Quicker way to create a new data frame from looping with R

Question

Using df, I am creating a new data frame (final.df) that has a row for every date between the startdate and enddate from the df datadframe.

df <- data.frame(claimid = c("123A", 
                             "125B", 
                             "151C", 
                             "124A", 
                             "325C"),
                 startdate = as.Date(c("2018-01-01", 
                                       "2017-05-20",
                                       "2017-12-15",
                                       "2017-11-05",
                                       "2018-02-06")),
                 enddate = as.Date(c("2018-01-06", 
                                     "2017-06-21",
                                     "2018-01-02",
                                     "2017-11-15",
                                     "2018-02-18")))

The nested functions below are what I'm using to currently create final.df, but when looping over hundreds of thousands of claims, this method of creating final.df takes hours to run. I'm looking for alternatives that will result in the creation of final.df more efficiently.

claim_level <- function(a) {
  specific_row <- df[a, ]
  dates <- seq(specific_row$startdate, specific_row$enddate, by="days")
  day_level <- function(b) {
    day <- dates[b]
    data.frame(claimid = specific_row$claimid, date = day)
  }
  do.call("rbind", lapply(c(1:length(dates)), function(b) day_level(b))) 
}
final.df <- do.call("rbind", lapply(c(1:nrow(df)), function(a) claim_level(a))) 

print(subset(final.df, claimid == "123A"))

#claimid    date
#123A       2018-01-01
#123A       2018-01-02
#123A       2018-01-03
#123A       2018-01-04
#123A       2018-01-05
#123A       2018-01-06

If speed is your concern, you might try `library(data.table); setDT(df)[, .(date = seq(startdate, enddate, by = "day")), by = claimid]` — markus, Jun 01 '18 at 18:35
Possible duplicate: [Expand rows by date range using start and end date](https://stackoverflow.com/questions/24803361/expand-rows-by-date-range-using-start-and-end-date#24804211) — markus, Jun 01 '18 at 18:36

acylam · Accepted Answer · 2018-06-01T19:58:33.723

You can use gather from tidyr to convert wide to long format, then use pad from padr to create new date rows between start and end date. The group = "claimid" argument lets you specify grouping variables:

library(dplyr)
library(tidyr)
library(padr)

df %>%
  gather(var, date, -claimid) %>%
  pad(group = "claimid") %>%
  select(-var)

Or with data.table for efficiency:

library(data.table)
setDT(df)[,.(date = seq(startdate, enddate, "days")), claimid]

Result:

   claimid       date
1     123A 2018-01-01
2     123A 2018-01-02
3     123A 2018-01-03
4     123A 2018-01-04
5     123A 2018-01-05
6     123A 2018-01-06
7     124A 2017-11-05
8     124A 2017-11-06
9     124A 2017-11-07
10    124A 2017-11-08
11    124A 2017-11-09
12    124A 2017-11-10
13    124A 2017-11-11
14    124A 2017-11-12
15    124A 2017-11-13
16    124A 2017-11-14
17    124A 2017-11-15
18    125B 2017-05-20
19    125B 2017-05-21
20    125B 2017-05-22
...

Benchmarks:

Initialize functions:

library(tidyverse)
library(padr)
library(data.table)

# OP's function
claim_level <- function(a) {
  specific_row <- df[a, ]
  dates <- seq(specific_row$startdate, specific_row$enddate, by="days")
  day_level <- function(b) {
    day <- dates[b]
    data.frame(claimid = specific_row$claimid, date = day)
  }
  do.call("rbind", lapply(c(1:length(dates)), function(b) day_level(b))) 
}

OP_f = function(){
  do.call("rbind", lapply(c(1:nrow(df)), function(a) claim_level(a))) 
}

# useR's tidyverse + padr
f1 = function(){
  df %>%
    gather(var, date, -claimid) %>%
    pad(interval = "day", group = "claimid") %>%
    select(-var)
}

# useR's data.table
DT = df
setDT(DT)

f2 = function(){
  DT[,.(date = seq(startdate, enddate, "days")), claimid]
}

# Moody_Mudskipper's Base R
f3 = function(){
  do.call(rbind,
          Map(function(claimid, startdate, enddate)
            data.frame(claimid, date=as.Date(startdate:enddate, origin = "1970-01-01")),
            df$claimid, df$startdate, df$enddate))
}

# Moody_Mudskipper's tidyverse
f4 = function(){
  df %>% 
    group_by(claimid) %>% 
    mutate(date = list(as.Date(startdate:enddate, origin = "1970-01-01"))) %>%
    select(1, 4) %>% 
    unnest %>%
    ungroup
}

# MKR's tidyr expand
f5 = function(){
  df %>% 
    group_by(claimid) %>%
    expand(date = seq(startdate, enddate, by="day"))
}

Check if identical:

> identical(OP_f() %>% arrange(claimid), data.frame(f1()))
[1] TRUE
> identical(OP_f(), data.frame(f2()))
[1] TRUE
> identical(OP_f(), data.frame(f3()))
[1] TRUE
> identical(OP_f(), data.frame(f4()))
[1] TRUE
> identical(OP_f() %>% arrange(claimid), data.frame(f5()))
[1] TRUE

Benchmark Results:

library(microbenchmark)
microbenchmark(OP_f(), f1(), f2(), f3(), f4(), f5())

Unit: milliseconds
   expr       min        lq      mean    median        uq        max neval
 OP_f() 26.421534 27.697194 30.342682 28.981143 31.537396  58.071238   100
   f1() 36.133364 38.179196 40.749812 39.870931 41.367655  58.428888   100
   f2()  1.005843  1.261449  1.450633  1.383232  1.559689   4.058900   100
   f3()  2.373679  2.534148  2.786888  2.633035  2.797452   6.941421   100
   f4() 22.659097 23.341435 25.275457 24.111411 26.499893  40.840061   100
   f5() 46.445622 48.148606 52.565480 51.185478 52.845829 176.912276   100

data.table is the winner in terms of speed, and @Moody_Mudskipper's Base R solution is the second best. Although padr::pad and tidyr::expand seems to be the most convenient, they are also the slowest (even slower than OP's original program).

Probably performance of `tidyr::expand` is worth considering as an option. — MKR, Jun 01 '18 at 19:51
@MKR Added it to benchmarks. Unfortunately, it seems to be even slower than `padr`. It's convenient though if you've only loaded `tidyverse` — acylam, Jun 01 '18 at 19:59
Thanks @useR. At least we got info about performance of it:-). I find it easy to use `expand`. — MKR, Jun 01 '18 at 20:01

moodymudskipper · Answer 2 · 2018-06-01T18:04:08.090

2

In base R:

do.call(rbind,
Map(function(claimid, startdate, enddate)
  data.frame(claimid, date=as.Date(startdate:enddate, origin = "1970-01-01")),
    df$claimid, df$startdate, df$enddate))

# claimid       date
# 1    123A 2018-01-01
# 2    123A 2018-01-02
# 3    123A 2018-01-03
# 4    123A 2018-01-04
# 5    123A 2018-01-05
# 6    123A 2018-01-06
#...

And using only tidyverse:

library(tidyverse) # for `dplyr` and `tidyr`
df %>% 
  group_by(claimid) %>% 
  mutate(dates = list(as.Date(startdate:enddate, origin = "1970-01-01"))) %>%
  select(1, 4) %>% 
  unnest %>%
  ungroup

# # A tibble: 82 x 2
#   claimid      dates
#    <fctr>     <date>
# 1    123A 2018-01-01
# 2    123A 2018-01-02
# 3    123A 2018-01-03
# 4    123A 2018-01-04
# 5    123A 2018-01-05
# 6    123A 2018-01-06
# 7    125B 2017-05-20
# 8    125B 2017-05-21
# 9    125B 2017-05-22
# 10   125B 2017-05-23
# # ... with 72 more rows

edited Jun 01 '18 at 18:04

answered Jun 01 '18 at 17:56

moodymudskipper

46,417
11
121
167

using your `tidyverse` approach, I receive the following error: "Error: Each column must either be a list of vectors or a list of data frames [dates]". It doesn't seem to want to unlist the `dates` column. – bshelt141 Jun 01 '18 at 18:31
On your real data or with sample data ? – moodymudskipper Jun 01 '18 at 18:33
with the sample data – bshelt141 Jun 01 '18 at 19:07
which version of `tidyverse` ? anybody else ? – moodymudskipper Jun 01 '18 at 19:09
`tidyverse 1.2.1` – bshelt141 Jun 01 '18 at 19:56
1

@bshelt141 I can't reproduce the error with `tidyverse 1.2.1`. Try restarting your session and run the code again. – acylam Jun 01 '18 at 20:02

score 2 · Answer 3 · answered Jun 01 '18 at 19:49

One option is to use tidyr::expand function to expand rows between startdate to enddate.

library(tidyverse)
df %>% group_by(claimid) %>%
  expand(date = seq(startdate, enddate, by="day")) %>%
  as.data.frame()

#    claimid       date
# 1     123A 2018-01-01
# 2     123A 2018-01-02
# 3     123A 2018-01-03
# 4     123A 2018-01-04
# 5     123A 2018-01-05
# 6     123A 2018-01-06
# 7     124A 2017-11-05
# 8     124A 2017-11-06
# 9     124A 2017-11-07
# 10    124A 2017-11-08
# 11    124A 2017-11-09
# 12    124A 2017-11-10
#
#  70 more rows

Quicker way to create a new data frame from looping with R

3 Answers3