Quickest way to get unique months, "other value" combinations using date ranges

Question

I have a data frame of start and end dates for 40k~ different doctors (MPINs). We can create an illustrative data frame using the code below:

x <- seq(as.Date("2014-01-01"), as.Date("2015-10-31"), by = "days")
y <- c(1:150)
mpins <- c(1000000:9999999)
s = 40000

df <- data.frame(start_date = as.Date(sample(x, s, replace = TRUE)),
             MPIN = sample(mpins, s, replace = FALSE))
df$end_date <- as.Date(df$start_date + sample(y, s, replace = TRUE))

head(df)
  start_date    MPIN   end_date
1 2015-10-31 1093782 2016-03-27
2 2014-12-06 4932382 2015-04-30
3 2015-02-28 5577980 2015-03-29
4 2014-07-31 9824869 2014-11-17
5 2014-06-14 5845130 2014-06-22
6 2014-07-13 2773056 2014-10-17

What I'd like to now do is find a very quick way to create a new data frame that contains a field for each unique month (formatted "YYYY-mm") for each unique MPIN in a data frame that looks like this:

head(target_df)
     months    MPIN
1   2015-10 1093782
2   2015-11 1093782
3   2015-12 1093782
4   2016-01 1093782
5   2016-02 1093782
6   2016-03 1093782

I've come up with a looping function to do this, but I feel that it is very inefficient. With 40k~ MPINs, it takes about 2 minutes, and the list of MPINs is only going to get larger with time. Below is my current-state solution:

df2 <- function(x) {
  tm1 <- df[x, ]
  dates <- data.frame(seq(as.Date(tm1$start_date), 
                      as.Date(tm1$end_date),
                      by = "days"))
  colnames(dates) <- c("dates")
  dates$months <- substr(as.character(dates$dates), 1, 7)
  dates <- dates[which(!duplicated(dates$months)), ]
  dates$MPIN <- tm1$MPIN
  dates$dates <- NULL
  print(dates)
}

a <- (1:nrow(df))
system.time(df3 <- do.call("rbind", lapply(a, function(x) df2(x))))

df3$unique <- paste0(df3$MPIN, "-", df3$months)
df3 <- df3[which(!duplicated(df3$unique)), ]
df3$unique <- NULL

head(df3)
     months    MPIN
1   2015-10 1093782
2   2015-11 1093782
32  2015-12 1093782
63  2016-01 1093782
94  2016-02 1093782
123 2016-03 1093782

Any suggestions to speed this process up would be greatly appreciated. Thanks!

Update

Slightly tweaking @Michele_Usuelli 's helpful recommendation, I was able to speed up the process by about 80%.

My original function's results:

user  system elapsed 
122.57    1.50  126.01

The results using the function below:

user  system elapsed 
 25.52    0.15   26.06 

library(data.table)
library(dplyr)

# for each record, create a sequence of dates
df <- data.table(df)
df4 <- df[, list(date = seq(start_date, end_date, by = "day"),
             MPIN = MPIN),
      by = 1:nrow(df)]

# determine the unique month-MPIN combinations
df5 <- df4 %>%
  group_by(month = paste0(format(date, "%Y"), "-", format(date, "%m")), MPIN) %>%
  summarise(n = n()) %>%
  select(-n)

are the `months`, all the `months` from the start to the end, for each MPIN? — SymbolixAU, May 02 '16 at 23:07

score 1 · Accepted Answer · answered May 02 '16 at 23:37

1

This should be much faster:

library(data.table)
library(dplyr)

# for each record, create a sequence of dates
df <- data.table(df)
df4 <- df[, list(date = seq(start_date, end_date, by = "day"),
                 MPIN = MPIN),
          by = 1:nrow(df)]

# determine the unique month-MPIN combinations
df5 <- df4 %>%
  group_by(month = format(date, "%m"), MPIN) %>%
  summarise(n = n())

Is the result the same?

answered May 02 '16 at 23:37

Michele Usuelli

1,970
13
15

1

Is there a reason to sequence the dates by "day" rather then "month" I would think it would be even faster just to add a check to see if the last month needed to be added manually (if the end day of the month is < the start day of the month) – gtwebb May 02 '16 at 23:42
I tweaked it a little, but this process is working much faster. Results and final dplyr function included as an update in this question's original post. Thank you! – bshelt141 May 03 '16 at 09:31

score 0 · Answer 2 · edited May 23 '17 at 12:23

0

You should be able to achieve this with rep function, here is a quick example:

monnb <- function(d) { lt <- as.POSIXlt(as.Date(d, origin="1900-01-01"))
                       lt$year*12 + lt$mon }
df$length <- monnb(df$end_date) - monnb(df$start_date)
res <- df[rep(row.names(df), times=df$length), c("start_date", "MPIN")]

To get the month for each record, you can do running count on the rows within the MPIN and add it to the start date.

edited May 23 '17 at 12:23

Community

1
1

answered May 02 '16 at 23:45

Bulat

6,869
1
29
52

I tried this approach, but the res function still produces a start_date instead of a unique month field. – bshelt141 May 03 '16 at 09:23
you can do running count (e.g. `cumsum`) on the rows within the `MPIN` and add it to the start date. – Bulat May 03 '16 at 10:41

Quickest way to get unique months, "other value" combinations using date ranges

2 Answers2