2

I have a dataset in tibble in R like the one below:

# A tibble: 50,045 x 5
   ref_key start_date end_date  
   <chr>   <date>     <date>    
 1 123     2010-01-08 2010-01-13
 2 123     2010-01-21 2010-01-23
 3 123     2010-03-10 2010-04-14

I need to create another tibble that each row only store one date, like the one below:

   ref_key date      
   <chr>   <date>    
 1 123     2010-01-08
 2 123     2010-01-09
 3 123     2010-01-10
 4 123     2010-01-11
 5 123     2010-01-12
 6 123     2010-01-13
 7 123     2010-01-21
 8 123     2010-01-22
 9 123     2010-01-23

Currently I am writing an explicit loop for that like below:

for (loop in (1:nrow(input.df))) {
  if (loop%%100==0) {
    print(paste(loop,'/',nrow(input.df)))
  }
  temp.df.st00 <- input.df[loop,] %>% data.frame
  temp.df.st01 <- tibble(ref_key=temp.df.st00[,'ref_key'],
                    date=seq(temp.df.st00[,'start_date'],
                             temp.df.st00[,'end_date'],1))
  if (loop==1) {
    output.df <- temp.df.st01
  } else {
    output.df <- output.df %>%
      bind_rows(temp.df.st01)
  }
}

It is working, but in a slow way, given that I have >50k rows to process, it takes a few minutes to finish the loop.

I wonder if this step can be vectorized, is it something related to row_wise in dplyr?

lokheart
  • 23,743
  • 39
  • 98
  • 169
  • Related: [Expand ranges defined by “from” and “to” columns](https://stackoverflow.com/questions/11494511/expand-ranges-defined-by-from-and-to-columns); [Expand rows by date range using start and end date](https://stackoverflow.com/questions/24803361/expand-rows-by-date-range-using-start-and-end-date?noredirect=1&lq=1) – Henrik Mar 18 '18 at 08:12
  • 1
    I would go with `library(data.table) ; setDT(input.df)[, .(ref_key, seq.int(start_date, end_date, by = "day")), by = 1:nrow(input.df)]` – David Arenburg Mar 18 '18 at 08:13

2 Answers2

3

We create a row name column (rownames_to_column), then nest the 'rn' and 'ref_key', mutate by taking the sequence of 'start_date' and 'end_date' within map and unnest after selecting out the unwanted columns

library(tidyverse)
res <- df1 %>%
         rownames_to_column('rn') %>% 
         nest(-rn, -ref_key) %>%
         mutate(date = map(data, ~ seq(.x$start_date, .x$end_date, by = "1 day"))) %>%
         select(-data, -rn) %>%
         unnest
head(res, 9)
#  ref_key       date
#1     123 2010-01-08
#2     123 2010-01-09
#3     123 2010-01-10
#4     123 2010-01-11
#5     123 2010-01-12
#6     123 2010-01-13
#7     123 2010-01-21
#8     123 2010-01-22
#9     123 2010-01-23
akrun
  • 874,273
  • 37
  • 540
  • 662
1

One solution is to use tidyr::complete to expand rows. Since row expansion is based on start-date and end_date of a row, hence group_by on row_number will help to generate sequence of Date between start-date and end_date.

library(dplyr)
library(tidyr)

  df %>% #mutate(rnum = row_number()) %>%
  group_by(row_number()) %>%
  complete(start_date = seq.Date(max(start_date), max(end_date), by="day")) %>%
  fill(ref_key) %>%
  ungroup() %>%
  select(ref_key, date = start_date)



# # A tibble: 45 x 2
# ref_key date      
# <int> <date>    
# 1     123 2010-01-08
# 2     123 2010-01-09
# 3     123 2010-01-10
# 4     123 2010-01-11
# 5     123 2010-01-12
# 6     123 2010-01-13
# 7     123 2010-01-21
# 8     123 2010-01-22
# 9     123 2010-01-23
# 10     123 2010-03-10
# # ... with 35 more rows

Data

df <- read.table(text = "ref_key start_date end_date  
123     2010-01-08 2010-01-13
123     2010-01-21 2010-01-23
123     2010-03-10 2010-04-14", header = TRUE, stringsAsFactor = FALSE)
df$start_date <- as.Date(df$start_date)
df$end_date <- as.Date(df$end_date)
MKR
  • 19,739
  • 4
  • 23
  • 33