1

I have a data frame where each row has a unique ID. I need to replicate each one of these rows based on the number of days between the start date and the max of the end date and the approval date.

ID <- c(1,2)

Value <- c(10,20)

StartDate <- c(as.Date("01/01/2015", '%d/%m/%Y'),
    as.Date("01/01/2015", '%d/%m/%Y'))

EndDate <- c(as.Date("31/01/2015", '%d/%m/%Y'), 
    as.Date("15/01/2015", '%d/%m/%Y'))

AppDate <- c(as.Date("15/01/2015", '%d/%m/%Y'), 
    as.Date("15/02/2015", '%d/%m/%Y'))

df <- data.frame(ID, Value, StartDate, EndDate, AppDate)

df <- df[rep(row.names(df), ifelse(as.numeric(df$AppDate) >
    as.numeric(df$EndDate),as.numeric(df$AppDate-df$StartDate),
    as.numeric(df$EndDate-df$StartDate)) + 1),]

I then need to add a sequential list of dates from the start date to the max of the end date or approval date.

I've done this via 2 loops. The outer loop loops through the data frame for each unique ID. The second loop then goes through the ID and adds the date. Once the second loop has finished it passes the row to the outer loop as the new start point.

IDs <- unique(df$ID)
df$Days <- rep(as.Date("01/01/1999",'%d/%m/%Y'), nrow(df))
counter <- 1
for (i in 1:length(IDs)) {
    ref <- IDs[i]
    start <- 1
        while (df$ID[counter] == ref) {
            ifelse(start == 1, df$Days[counter] <- df$StartDate[counter],
                df$Days[counter] <- df$StartDate[counter] + start -1)
            ifelse (counter > nrow(df), break, counter <- counter + 1)
            ifelse (counter > nrow(df), break, start <- start + 1)
        }
 }

My actual data set has over 6,000 ID's and once I've replicated the rows it ends up being over 500,000 rows. The loop took over 15 minutes to run so it's obviously very inefficient.

So I guess I have 2 questions.

1). What is the most efficient way to do this in R

2). What would be the most efficient way of doing this in general i.e. in say something like C++

thanks

MidnightDataGeek
  • 938
  • 12
  • 21
  • 1
    I think the `data.table` answer in this Q&A provides the general principle for an efficient solution: [Expanding a sequence in a data frame](http://stackoverflow.com/questions/11494511/expanding-a-sequence-in-a-data-frame?lq=1). Google "R expand date range data.table" should provide several similar examples. – Henrik Feb 01 '16 at 00:12

2 Answers2

0

Here is one solution that is vectorized. Note: Your code does not match the concept of taking the maximum of EndDate and AppDate, which I tried to do, but if that is not what you want, you can modify the code accordingly.

library(dplyr)
df <- df %>% group_by(ID) %>% mutate(Days = rep(seq(min(StartDate), max(EndDate, df$AppDate), 'days'), ceiling(nrow(df) / n()))[1:n()])

Output will be as follows (just the first few rows):

head(df)
Source: local data frame [6 x 6]
Groups: ID [1]

     ID Value  StartDate    EndDate    AppDate       Days
  (dbl) (dbl)     (date)     (date)     (date)     (date)
1     1    10 2015-01-01 2015-01-31 2015-01-15 2015-01-01
2     1    10 2015-01-01 2015-01-31 2015-01-15 2015-01-02
3     1    10 2015-01-01 2015-01-31 2015-01-15 2015-01-03
4     1    10 2015-01-01 2015-01-31 2015-01-15 2015-01-04
5     1    10 2015-01-01 2015-01-31 2015-01-15 2015-01-05
6     1    10 2015-01-01 2015-01-31 2015-01-15 2015-01-06

tail(df)
Source: local data frame [6 x 6]
Groups: ID [1]

     ID Value  StartDate    EndDate    AppDate       Days
  (dbl) (dbl)     (date)     (date)     (date)     (date)
1     2    20 2015-01-01 2015-01-15 2015-02-15 2015-02-10
2     2    20 2015-01-01 2015-01-15 2015-02-15 2015-02-11
3     2    20 2015-01-01 2015-01-15 2015-02-15 2015-02-12
4     2    20 2015-01-01 2015-01-15 2015-02-15 2015-02-13
5     2    20 2015-01-01 2015-01-15 2015-02-15 2015-02-14
6     2    20 2015-01-01 2015-01-15 2015-02-15 2015-02-15
Gopala
  • 10,363
  • 7
  • 45
  • 77
  • Thanks for that it worked perfectly. I love the simplicity of the vectorized version vs. the code that I wrote. It was a lot slower than I was expecting though, is this the quickest possible way to do it? – MidnightDataGeek Feb 01 '16 at 18:34
  • Depends on size of data, number of groups and ranges of dates you are dealing with. Also, it may help if you define 'slow.' There is a different package called `data.table` that is considered more efficient than `dplyr`. I don't use it because I find `dplyr` to be much more versatile and intuitive (personal opinion). Someone else may be able to help you covert the code to `data.table` version. – Gopala Feb 01 '16 at 22:43
0

Normally, I would recommend the cross join SQL query that returns a cartesian product (all combination between two sets). However, you can replicate the cross join in R using merge() without any by arguments and with all=True. From there, filter for EndDate cut-off:

# CALCULATE CONDITIONAL END DATE
df$TrueEndDate <- as.Date(ifelse(df$AppDate > df$EndDate,
                                 df$AppDate,
                                 df$EndDate), origin="1970-01-01")

# CREATE A SEQUENTIAL DATES DATA FRAME (HERE IS 60 DAYS FROM 2015-01-01)
dates <- data.frame(Date=as.Date(unlist(lapply(0:60, function(x) 
                                      as.Date("2015-01-01") + x)),                    
                                 origin="1970-01-01"))   

# RUN CROSS JOIN MERGE, PULLING ONLY NEEDED FIELDS
mergedf <- merge(df[c('ID', 'StartDate', 'TrueEndDate')], dates, all=TRUE)

# FILTER OUT DATES PAST ROW'S TRUE END DATE
mergedf <- mergedf[(mergedf$Date <= mergedf$TrueEndDate),]

# CLEANUP
mergedf <- mergedf[with(mergedf, order(ID)), ]     # ORDER BY ID
row.names(mergedf) <- 1:nrow(mergedf)              # RESET ROW NAMES

Should you be curious on the equivalent cross join SQL (which you can have R call on a RDMS engine and import as final data frame, may help for performance issues):

SELECT ID.ID, ID.Value, ID.StartDate, 
       CASE WHEN ID.AppDate > ID.EndDate 
            THEN ID.AppDate 
            ELSE ID.EndDate 
       END As TrueEndDate, 
Dates.Dates
FROM ID, Dates
WHERE  Dates.Dates <= CASE WHEN ID.AppDate > ID.EndDate 
                           THEN ID.AppDate ELSE ID.EndDate 
                      END
ORDER BY ID.ID, Dates.Dates
Parfait
  • 104,375
  • 17
  • 94
  • 125