0

I have a pretty large data set with users and their membership start and end dates. For each membership period there is one entry.

I have another dataset, which is coming from the support system, and it has records of user id's along with the dates of each system usage. This dataset is even larger, as there is one record for each usage.

I need to aggregate the second and combine with the first one, based on each user and membership period.

I tried a function for a for loop but for an extremeley large dataset (her we are talking about some few millions of rows) this will take ages.

Edit: The join or merge will not work, because here there are several ranges (between start and end dates) for each ID in the first frame. Each range has been assigned a number. (Period of membership) The second data frame has dates and IDs and the problem is finding the membership period for each ID & date by comparing it to the date ranges in the first frame.

Here is the code, along with mock datasets and what I want to achieve at the end:

ids <- c(rep("id1", 5), rep("id2", 5), rep("id3", 5))
#
stdates <- c("2015-08-01", "2016-08-01", "2017-08-01", "2018-08-01", "2019-08-01",
             "2013-05-07", "2014-05-07", "2015-05-07", "2016-05-07", "2017-05-07",
             "2011-02-13", "2013-02-13", "2015-02-13", "2016-02-13", "2017-02-13")
#
endates <- c("2016-07-31", "2017-07-31", "2018-07-31", "2019-07-31", "2020-07-31",
             "2014-05-06", "2015-05-06", "2016-05-06", "2017-05-06", "2018-05-06",
             "2013-02-12", "2015-02-12", "2016-02-12", "2017-02-12", "2018-02-12")
#
# First dataset:
df <- data.table(id = ids,
                 stdate = stdates,
                 endate = endates)
#
df <- df %>%
  arrange(id, desc(endate))
#
# Add the membership period number for each user:
setDT(df)
df[, counter := rowid(id)]
#
# Second dataset:
ids2 <- sample(df$id, 1000, replace = TRUE)
dates2 <- sample(seq(Sys.Date() - 7*365, Sys.Date() - 365,  1), 1000)

#
df2 <- data.table(id = ids2,
                  dateticket = dates2)
#
# Function
counterFunc <- function(d2, d1) {
  d2$groupCounter <- NA
  for (i in 1:nrow(d2)) {
    crdate <- d2$dateticket[i]
    idtemp <- d2$id[i]
    dtemp <- d1 %>%
      filter(id == idtemp) %>%
      data.table()
    dtemp[, drcode := ifelse(crdate >= stdate & crdate <= endate, 1, 0)]
    if (length(unique(dtemp$drcode)) == 2) {
      dtempgc <- dtemp[drcode == 1]$counter
      d2$groupCounter[i] <- dtempgc
    }
    if (length(unique(dtemp$drcode)) != 2) {
      d2$groupCounter[i] <- 0
    }
    print(i)
  }
  return(d2)
}
#
# The result I want to get without a for loop:
df2gc <- counterFunc(df2, df)
#
maop
  • 194
  • 14
  • Use a data.table non-equi join ... – Roland Aug 02 '19 at 10:26
  • Possible duplicate of [How to join (merge) data frames (inner, outer, left, right)](https://stackoverflow.com/questions/1299871/how-to-join-merge-data-frames-inner-outer-left-right) – asachet Aug 02 '19 at 11:25
  • it is not a join problem. It is a look-up problem. – maop Aug 02 '19 at 12:12
  • You solve lookup problems with a join ... – Roland Aug 05 '19 at 06:03
  • The look-up here is for a range, and not a definite value that can be used by 'merge'. I cannot really see, how one can apply a join in this case. @Zeiram has provided an answer which works fine but not for a dataset with 5M rows. The problem is similar to an IP lookup within range of IPs. – maop Aug 12 '19 at 07:59

1 Answers1

1

The operation you want to do is called "joining", so depending on the direction and completion of that "joining" there are some options.

Here is a simple example:

df1<-data.frame("ID"=c("1","2","3","1","2"),"First_Name"=c("A","B","C","D","E"))

df2<-data.frame("ID"=c("1","2","3"),"Last_Name"=c("Ko","Lo","To"))

left_join(df1,df2,by = "ID")

The result looks like this:

 ID First_Name Last_Name 
 1          A        Ko
 2          B        Lo
 3          C        To
 1          A        Ko
 2          B        Lo

left_joinfrom the dplyrpackage simply looked up the relevant values in the look-up table (df2) and added them to the original table (df1, the left table) based on a "key" (by = "ID" in this case).

There are other operations that specify the terms of the joining more but left_joinshould be helpful in your case.

EDIT:

I have better understood your problem now. Please check if this solves it:

library(tidyverse)
df %>%
  mutate(stdate = as.Date(stdate), endate = as.Date(endate)) %>%
  left_join(df2, by = "id") %>%
  mutate(check = case_when(dateticket >= stdate & dateticket <= endate ~ "TRUE", TRUE ~ "FALSE")) %>%
  filter(check == "TRUE")

Edit:

For the problem the error "Cannot allocate vector of size" with join please refer to this: Left_join error cannot allocate vector of size

Fnguyen
  • 1,159
  • 10
  • 23
  • thanks for the answer, but join (or merge, or left_join) does not solve the problem. My problem is not joining two data frames (data tables). In the second frame, there are many entries matching the date periods of the first frame. The problem is to get the right counter for each entry of the second frame. Second frame will be merged after it is aggregated. – maop Aug 02 '19 at 12:09
  • @maop joing is performing a look-up, see my example. The left table "looks" for the right value to add from the look-up/right table based on the provided key. If it is not a look-up problem, what else do you want to do? Can you provide a simple input/output example like I did? – Fnguyen Aug 02 '19 at 12:37
  • The example data frames are in my code. The key is not enough in this case. You should find for each id and date, which counter it corresponds to, by checking which start-end date period that the date from the second table corresponds. – maop Aug 02 '19 at 12:56
  • for example, id1 has membership period from 2018-02-08 to 2019-02-17 and this is id1's 3rd membership period, hence the counter is 3. ID1 has two tickets in the other data frame, one is from 2018-03-15 and the other is from 2019-05-16. The problem is assigning a counter in the second frame, for the ticket from 2018-03-15 as 3. This date would be corresponding to another period for another ID. For each ID and date, this lookup should be done. – maop Aug 02 '19 at 13:02
  • @maop I assume the membership period vary by individual and can not be added easily as groups to the second frame? Look-up/join can act on two keys, so your real problem is just creating the second key, correct? Maybe you can edit your question a bit in that direction. Also I cannot see any mock data, so it is hard to improve my answer. – Fnguyen Aug 02 '19 at 13:22
  • correct. If you run the code I gave within my question, you will see the data. (sorry, ids and dates were somehow not copied. Added them now.) The function I wrote does what I want. But it is not efficient for large datasets. I am looking for an efficient way – maop Aug 02 '19 at 14:23
  • @maop I edited my answer, check the second code, it did work for me. – Fnguyen Aug 02 '19 at 16:06
  • hey @Zeiram yes this worked! Thanks a lot! I'll try it on my larger datasets and let you know. – maop Aug 02 '19 at 16:17
  • unfortunately it doesn't work with the actual dataset. I get 'cannot allocate a vector of size...' error, although I reduced the number of columns in both data tables. – maop Aug 02 '19 at 20:22
  • @maop Without knowing the real data there is no way to know why my function does not work but the error indicates size of the data set. This function isn't too resource heavy but if your dataset is too big there is a different problem. I suggest looking up sparse matrix functions, cleaning memory in R, etc. – Fnguyen Aug 02 '19 at 20:27
  • Your suggested code piece worked on an EC2 instance. I will accept your answer but will keep searching for an efficient way. – maop Aug 12 '19 at 13:22