I have a pretty large data set with users and their membership start and end dates. For each membership period there is one entry.
I have another dataset, which is coming from the support system, and it has records of user id's along with the dates of each system usage. This dataset is even larger, as there is one record for each usage.
I need to aggregate the second and combine with the first one, based on each user and membership period.
I tried a function for a for loop but for an extremeley large dataset (her we are talking about some few millions of rows) this will take ages.
Edit: The join or merge will not work, because here there are several ranges (between start and end dates) for each ID in the first frame. Each range has been assigned a number. (Period of membership) The second data frame has dates and IDs and the problem is finding the membership period for each ID & date by comparing it to the date ranges in the first frame.
Here is the code, along with mock datasets and what I want to achieve at the end:
ids <- c(rep("id1", 5), rep("id2", 5), rep("id3", 5))
#
stdates <- c("2015-08-01", "2016-08-01", "2017-08-01", "2018-08-01", "2019-08-01",
"2013-05-07", "2014-05-07", "2015-05-07", "2016-05-07", "2017-05-07",
"2011-02-13", "2013-02-13", "2015-02-13", "2016-02-13", "2017-02-13")
#
endates <- c("2016-07-31", "2017-07-31", "2018-07-31", "2019-07-31", "2020-07-31",
"2014-05-06", "2015-05-06", "2016-05-06", "2017-05-06", "2018-05-06",
"2013-02-12", "2015-02-12", "2016-02-12", "2017-02-12", "2018-02-12")
#
# First dataset:
df <- data.table(id = ids,
stdate = stdates,
endate = endates)
#
df <- df %>%
arrange(id, desc(endate))
#
# Add the membership period number for each user:
setDT(df)
df[, counter := rowid(id)]
#
# Second dataset:
ids2 <- sample(df$id, 1000, replace = TRUE)
dates2 <- sample(seq(Sys.Date() - 7*365, Sys.Date() - 365, 1), 1000)
#
df2 <- data.table(id = ids2,
dateticket = dates2)
#
# Function
counterFunc <- function(d2, d1) {
d2$groupCounter <- NA
for (i in 1:nrow(d2)) {
crdate <- d2$dateticket[i]
idtemp <- d2$id[i]
dtemp <- d1 %>%
filter(id == idtemp) %>%
data.table()
dtemp[, drcode := ifelse(crdate >= stdate & crdate <= endate, 1, 0)]
if (length(unique(dtemp$drcode)) == 2) {
dtempgc <- dtemp[drcode == 1]$counter
d2$groupCounter[i] <- dtempgc
}
if (length(unique(dtemp$drcode)) != 2) {
d2$groupCounter[i] <- 0
}
print(i)
}
return(d2)
}
#
# The result I want to get without a for loop:
df2gc <- counterFunc(df2, df)
#