3

I have a question regarding dates manipulation in R. I've looked around for days but couldn't find any help online. I have a dataset where I have id and two dates and another dataset with the same id variable, date and price. For example:

x = data.frame(id = c("A","B","C","C"), 
               date1 = c("29/05/2013", "23/08/2011", "25/09/2011",  "18/11/2011"),    
               date2 = c("10/07/2013", "04/10/2011", "10/11/2011", "15/12/2011") )
> x
  id      date1      date2
1  A 29/05/2013 10/07/2013
2  B 23/08/2011 04/10/2011
3  C 25/09/2011 10/11/2011
4  C 18/11/2011 15/12/2011

y = data.frame(id = c("A","A","A","B","B","B","B","B","B","C","C","C"),
              date = c("21/02/2013",  "19/06/2013", "31/07/2013", "07/10/2011", "16/01/2012", "10/07/2012","20/09/2012", "29/11/2012",  "15/08/2014", "27/09/2011", "27/01/2012", "09/03/2012"),
              price = c(126,109,111,14,13.8,14.1,14, 14.4,143,102,114,116))
> y
   id       date price
1   A 21/02/2013 126.0
2   A 19/06/2013 109.0
3   A 31/07/2013 111.0
4   B 07/10/2011  14.0
5   B 16/01/2012  13.8
6   B 10/07/2012  14.1
7   B 20/09/2012  14.0
8   B 29/11/2012  14.4
9   B 15/08/2014 143.0
10  C 27/09/2011 102.0
11  C 27/01/2012 114.0
12  C 09/03/2012 116.0

What I would like to do is to look for the two dates in dataset x and if there is a date in dataset y inside defined by the two dates in dataset x for the same id, to pick the value of price for that id and date. If not have it as missing. So basically I want to end up with a final dataset like that:

final = data.frame(id = c("A","B","C","C"), 
                   date1 = c("29/05/2013", "23/08/2011", "25/09/2011",  "18/11/2011"),    
                   date2 = c("10/07/2013", "04/10/2011", "10/11/2011",  "15/12/2011"),
                   date = c("19/06/2013",  "NA", "27/09/2011", "NA"),
                   price = c(109,"NA",102,"NA")  )  

> final
  id      date1      date2       date price
1  A 29/05/2013 10/07/2013 19/06/2013   109
2  B 23/08/2011 04/10/2011 20/09/2012    14
3  C 25/09/2011 10/11/2011 27/09/2011   102
4  C 18/11/2011 15/12/2011         NA    NA

Any help will be much appreciated.

PetGous
  • 81
  • 10
  • see also: [roll join with start/end window](http://stackoverflow.com/questions/24480031/roll-join-with-start-end-window) – MrFlick Jun 19 '15 at 19:54

3 Answers3

2

Here a solution based on the excellent foverlaps of the data.table package.

library(data.table)
## coerce characters to dates ( numeric) 
setDT(x)[,c("date1","date2"):=list(as.Date(date1,"%d/%m/%Y"),
                                   as.Date(date2,"%d/%m/%Y"))]
## and a dummy date since foverlaps looks for a start,end columns 
setDT(y)[,c("date1"):=as.Date(date,"%d/%m/%Y")][,date:=date1]
## y must be keyed
setkey(y,id,date,date1)
foverlaps(x,y,by.x=c("id","date1","date2"))[,
            list(id,i.date1,date2,date,price)]

  id    i.date1      date2       date price
1:  A 2013-05-29 2013-07-10 2013-06-19   109
2:  B 2011-08-23 2011-10-04       <NA>    NA
3:  C 2011-09-25 2011-11-10 2011-09-27   102
4:  C 2011-11-18 2011-12-15       <NA>    NA

PS: the result is not exactly the same because you have an error in your expected output.

agstudy
  • 119,832
  • 17
  • 199
  • 261
1

I'd take this in two steps. First, join each df by id (see this link for more detail on joins), as follows:

df <- merge(x, y, by = "id")

Now you should have a full dataset, with even more entries than you asked for. To whittle it down by your criteria, try:

df <- filter(df, date > date1, date < date2)

I believe that should work.

Edit: If you actually want NA values in there instead of just the removal of that data, it makes it a little more hairy. What I would do in that case, instead of the filter step, try this:

df$price[date < date1] <- NA
df$price[date > date2] <- NA
df$date[date < date1] <- NA
df$date[date > date2] <- NA
Community
  • 1
  • 1
ila
  • 709
  • 4
  • 15
1

Or with lubridate and base R:

m <- merge(x, y, by='id')
d_range <- m$date1 %--% m$date2
m2 <- m[m$date %within% d_range, ]
res <- merge(x, m2, by=c('id', 'date1', 'date2'), all.x=T)

As @Isaac suggested, merging helps make the process faster. The operator %--% from the lubridate package creates an interval. The operator %within% tests whether the LHS object lies in the RHS range.

  id      date1      date2       date price
1  A 2013-05-29 2013-07-10 2013-06-19   109
2  B 2011-08-23 2011-10-04       <NA>    NA
3  C 2011-09-25 2011-11-10 2011-09-27   102
4  C 2011-11-18 2011-12-15       <NA>    NA

Data

x = data.frame(id = c("A","B","C","C"), 
               date1 = c("29/05/2013", "23/08/2011", "25/09/2011",  "18/11/2011"),    
               date2 = c("10/07/2013", "04/10/2011", "10/11/2011",  "15/12/2011"))

y = data.frame(id = c("A","A","A","B","B","B","B","B","B","C","C","C"),
              date = c("21/02/2013",  "19/06/2013",  "31/07/2013",  "07/10/2011",   "16/01/2012",   "10/07/2012","20/09/2012",  "29/11/2012",       "15/08/2014",   "27/09/2011",   "27/01/2012",   "09/03/2012"),
              price = c(126,109,111,14,13.8,14.1,14,    14.4,143,102,114,116))

x[c('date1', 'date2')] <- lapply(x[c('date1', 'date2')], dmy)
y['date'] <- dmy(y[,'date'])
Pierre L
  • 28,203
  • 6
  • 47
  • 69