2

I'm attempting to create a separate table from my original data that contains all of the dates between two dates, represented by separate columns in my original table. I have successfully completed this using a loop, but I am sure there's a more efficient means of doing this. In the example data I have provided, there are only 3 row, but the real data set I'm working with has > 500,000. I can't afford inefficiency.

Example:

df <- data.frame(
        id = c('A','B','C'),
        fromDate = c('2020-01-01','2020-02-01','2020-03-05'),
        toDate = c('2020-01-10','2020-02-03','2020-03-06')
        ) 

#output
------------------------------
 id    fromDate      toDate   
---- ------------ ------------
 A    2020-01-01   2020-01-10 

 B    2020-02-01   2020-02-03 

 C    2020-03-05   2020-03-06 
------------------------------

#current solution

results <- data.frame(id = NULL,timespan = NULL)

for(i in 1:nrow(df)){

  results <- rbind(
                    results,
                    data.frame(id = df$id[i], timespan = seq(as.Date(df$fromDate[i]),as.Date(df$toDate[i]),by = 'days'))
                  )
                    }

#results
-----------------
 id    timespan  
---- ------------
 A    2020-01-01 

 A    2020-01-02 

 A    2020-01-03 

 A    2020-01-04 

 A    2020-01-05 

 A    2020-01-06 

 A    2020-01-07 

 A    2020-01-08 

 A    2020-01-09 

 A    2020-01-10 

 B    2020-02-01 

 B    2020-02-02 

 B    2020-02-03 

 C    2020-03-05 

 C    2020-03-06 
-----------------

Any suggestions on how to speed this up for scale?

Ric S
  • 9,073
  • 3
  • 25
  • 51
kostr
  • 836
  • 7
  • 13
  • If efficiency is a concern, you should consider if you really need to cast the date ranges from the sparse representation in your data to a dense representation. Why are you doing this? Maybe there is an alternative approach? – Roland Mar 24 '20 at 13:44

2 Answers2

1

This will probably be rather slow for such a large number of rows, regardless how you do it. I'd try to avoid this.

Anyway, you can use package data.table for efficient "apply-by-group":

library(data.table)
setDT(df)
df[, c("fromDate", "toDate") := lapply(.(fromDate, toDate), as.Date)]
results <- df[, seq(fromDate, toDate, by = "1 day"), by = id]
#    id         V1
# 1:  A 2020-01-01
# 2:  A 2020-01-02
# 3:  A 2020-01-03
# 4:  A 2020-01-04
# 5:  A 2020-01-05
# 6:  A 2020-01-06
# 7:  A 2020-01-07
# 8:  A 2020-01-08
# 9:  A 2020-01-09
#10:  A 2020-01-10
#11:  B 2020-02-01
#12:  B 2020-02-02
#13:  B 2020-02-03
#14:  C 2020-03-05
#15:  C 2020-03-06
Roland
  • 127,288
  • 10
  • 191
  • 288
0

Using dplyr, tidyr and padr, assuming the date columns are actually dates. Otherwise cast them into dates first.

df %>% pivot_longer(cols = c(fromDate, toDate), values_to = "timespan") %>% 
  select(-name) %>% 
  pad(interval = "day") %>% 
  fill(id)

# A tibble: 66 x 2
   id    timespan  
   <chr> <date>    
 1 A     2020-01-01
 2 A     2020-01-02
 3 A     2020-01-03
 4 A     2020-01-04
 5 A     2020-01-05
 6 A     2020-01-06
 7 A     2020-01-07
 8 A     2020-01-08
 9 A     2020-01-09
10 A     2020-01-10
# ... with 56 more rows

Speedwise the data.table answer by @Roland might be the better solution as you will shoot into millions of records.

phiver
  • 23,048
  • 14
  • 44
  • 56