0

I was wondering if anyone could help me with the following:

i have two data sets: (1) one containing an id and an order_date (2) the second containing of that same id and delivery_dates of emails

I want to count the number of emails a person receives before the order_date. However, I cannot manage to do so. When I merge both data files, the order_dates are coupled with the delivery dates, and that is not what I want. Also, I do not want to count all the delivery dates for one person, since it needs to be time dependent.

I hope someone could help me!!

example dataset 1:

id.  order_date age
xx3  2014/07/04 72
xx3  2014/10/08 72
xx3  2014/11/12 72
xx7  2014/05/02 34
xx7  2014/07/09 34
xx9  2014/12/22 55

example dataset 2:

id. delivery_date 
xx3 2014/07/02
xx3 2014/08/10
xx3 2014/11/02
xx3 2014/07/02
xx3 2014/12/02
xx3 2014/12/11
xx7 2014/07/05

what i would want:

id. frequency_received order_date
xx3 1                  2014/07/04
xx3 3                  2014/10/08

The dates are in YYYYMMDD format.

Jaap
  • 81,064
  • 34
  • 182
  • 193
Mark
  • 21
  • 4
  • 2
    Please give us a data example, so we can try (see `dput()`) for instance. Share with SO the code you've tried. Be more clear in your question. Here in "what I want" you have `order_date`: `03/11/2014`, but this date is part of the `delivery_date`?!? – timat Jan 06 '17 at 14:06
  • The date in "what I want" was just an example, I have changed it. What do you mean by dput()? – Mark Jan 06 '17 at 14:12
  • @timat I have tried to merge the files like this: 'new_data <- merge(x=orders, y=emails, by="id", all.x=TRUE)' However, when I do that, all the emails end up in there, and it copies all the order dates, thus making computations after the merge invalid. – Mark Jan 06 '17 at 14:15
  • Are those USA dates with MM/DD/YYYY or non-US with DD/MM/YYYY? As of now, these dates are not in datetime formats. – Parfait Jan 06 '17 at 14:15
  • please read this: http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example, use SO formatting, and update your response instead of writing in the comment section – timat Jan 06 '17 at 14:17

2 Answers2

2

Conditional merging hasn't been implemented in dplyr yet, you could use data.table, you could also use sqldf like this:

library(sqldf)
library(lubridate)
library(dplyr)

options(sqldf.driver = "SQLite")

tblA <- data.frame(id = c('xx3', 'xx3', 'xx3', 'xx7', 'xx7', 'xx9'),
                   order_date = c('2014/07/04', '2014/10/08', '2014/11/12',
                                  '2014/05/02', '2014/07/09', '2014/12/22'),
                   age = c(72, 72, 72, 34, 34, 55),
                   stringsAsFactors = FALSE)

tblB <- data.frame(id = c('xx3', 'xx3', 'xx3', 'xx3', 'xx3', 'xx3', 'xx7'),
                   delivery_date = c('2014/07/02', '2014/08/10', '2014/11/02',
                                     '2014/07/02', '2014/12/02', '2014/1211',
                                     '2014/07/05'),
                   stringsAsFactors = FALSE)

tblA$order_date <- ymd(tblA$order_date)
tblB$delivery_date <- ymd(tblB$delivery_date)

tblC <- sqldf("select tblA.id, order_date, delivery_date
               from tblA
               join tblB
                 on tblA.id = tblB.id
                 and tblA.order_date >= tblB.delivery_date")

tblC
answer <- tblC %>%
    group_by(id, order_date) %>%
    summarise(frequency_received = n())

as.data.frame(answer)

This gives:

   id order_date frequency_received
1 xx3 2014-07-04                  2
2 xx3 2014-10-08                  3
3 xx3 2014-11-12                  4
4 xx7 2014-07-09                  1
A. Lester
  • 100
  • 8
  • Thank you! This was the solution! @A.Lester – Mark Jan 06 '17 at 14:44
  • Glad to help! if you haven't already, please hit the grey/green tick to accept the answer. – A. Lester Jan 06 '17 at 14:48
  • This could be all combined into a single SQL statement: `sqldf("select tblA.id, order_date, count(*) freq from tblA join tblB on tblA.id = tblB.id and tblA.order_date >= tblB.delivery_date group by tblA.id, order_date")` – G. Grothendieck Jan 06 '17 at 15:08
  • Yeah it's cleaner and easier your way. I try not to use SQL in R more than strictly required. Personal preference. – A. Lester Jan 06 '17 at 15:36
  • I'm sorry about that. I got over-excited. Thank you for pointing it out to me. I'm really just looking to get my reputation higher, and the OP seemed very satisfied that I got the right answer for him. – A. Lester Jan 06 '17 at 16:29
  • @A.Lester, how can I make sure that all the other variables in tblA will be restored? When run this line, all the other variables from tblA are skipped? – Mark Jan 07 '17 at 20:22
  • @G.Grothendieck, how can I make sure that all the other variables in tblA will be restored? When run this line, all the other variables from tblA are skipped? (I couldn't add you both in the same comment)) – Mark Jan 07 '17 at 20:23
  • `select tblA.*, count(*) freq from ...` – G. Grothendieck Jan 07 '17 at 20:55
1

A possible solution would be to use the foverlaps-function from the data.table-package:

library(data.table)
# convert the 'data.frame's to 'data.table's with setDT()
setDT(ds1)
setDT(ds2)

# create a reference dataset with the minimum dates for each id
md <- ds2[, min(delivery_date), id
          ][ds1[, min(order_date), id], on = 'id'
            ][is.na(V1), V1 := i.V1
              ][, mindate := pmin(V1, i.V1)
                ][, .(id, mindate)]

# create a start date for the time window in which the emails should have been sent
ds1[, bdate := shift(order_date, fill = min(md$mindate[match(id,md$id)])-1), by = id]
# create 2nd deliverydate needed for the foverlaps function
ds2[, delivery_date2 := delivery_date]

# set the keys for each 'data.table'
setkey(ds1, id, bdate, order_date)
setkey(ds2, id, delivery_date, delivery_date2)

# perform the overlap join & calculate the number of recieved emails (freq)
foverlaps(ds1, ds2, nomatch = 0)[, .(freq = .N), by = .(id, order_date)][, freq := cumsum(freq), by = id][]

gives:

    id order_date freq
1: xx3 2014-07-04    2
2: xx3 2014-10-08    3
3: xx3 2014-11-12    4
4: xx7 2014-07-09    1

Used data:

ds1 <- structure(list(id = structure(c(1L, 1L, 1L, 2L, 2L, 3L), .Label = c("xx3", "xx7", "xx9"), class = "factor"), 
                      order_date = structure(c(16255, 16351, 16386, 16192, 16260, 16426), class = "Date"), 
                      age = c(72L, 72L, 72L, 34L, 34L, 55L)), 
                 .Names = c("id", "order_date", "age"), row.names = c(NA, -6L), class = "data.frame")
ds2 <- structure(list(id = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L), .Label = c("xx3", "xx7"), class = "factor"), 
                      delivery_date = structure(c(16253, 16292, 16376, 16253, 16406, 16415, 16256), class = "Date")), 
                 .Names = c("id", "delivery_date"), row.names = c(NA, -7L), class = "data.frame")
Jaap
  • 81,064
  • 34
  • 182
  • 193