0

I am trying to perform a (simplified!) query on colour and date differences on the following two databases (extract):

A                           B   
    A.COL   A.TIME              B.COL   B.TIME
1   blue    2009-01-31      1   blue    2007-01-31
2   blue    2009-02-28      2   blue    2008-12-31
3   blue    2009-03-31      3   blue    2009-02-28
4   blue    2009-04-30      4   blue    2009-04-30
5   blue    2009-05-31      5   blue    2009-06-30
6   blue    2009-06-30      6   blue    2016-08-31
7   blue    2016-03-31
8   blue    2016-04-30
9   red ...
10  red ...

What I want to do: Merge the tables based on COL as well as the difference in TIME, that is the difference betweeen both times must not be larger or lower than 2 month (or in other words lie between -2 and +2, depending from which date one starts with).

# For example starting with observation 1 from A, that would imply 2 matches:
2009-01-31 matched to 2008-12-31 (diff = 1)
2009-01-31 matched to 2009-02-28  (diff = -1)

# for obs 2 from A, that would imply 
2009-02-28 matched to 2008-12-31 (diff = 2)
2009-02-28 matched to 2009-02-28 (diff = 0)
2009-02-28 matched to 2009-04-30 (diff = -2)

etc. I was thinking about some kind of Date difference function, either from lubridate, which was problematic in cases of months with less than 30 days and sometimes made NAs, or with as.yearmon from zoo, which worked to correctly calculate the differences at least. However, I wasn't able to implement this into sqldf properly (Error: error in statement: near "as": syntax error). The reason seems to be that one cannot use every R function withing sqldf. Any ideas how it can be done in R? I was also looking for an elegant way of how to subtract months from each other. There is this issue with lubridate: Add/subtract 6 months (bond time) in R using lubridate, but here was one proposed way how to accomplish it with zoo: Get the difference between dates in terms of weeks, months, quarters, and years

Get the data (thanks @bouncyball below for the code):

A <- read.table(
  text = "
  A.COL   A.TIME          
  blue    2009-01-31     
  blue    2009-02-28      
  blue    2009-03-31      
  blue    2009-04-30      
  blue    2009-05-31      
  blue    2009-06-30
  blue    2016-03-31
  blue    2016-04-30
  ", header = T, stringsAsFactors = FALSE)


B <- read.table(
  text = "
  B.COL   B.TIME
  blue    2007-01-31
  blue    2008-12-31
  blue    2009-02-28
  blue    2009-04-30
  blue    2009-06-30
  blue    2016-08-31
  ", stringsAsFactors = FALSE, header = T)
Community
  • 1
  • 1
user3032689
  • 627
  • 1
  • 10
  • 23

1 Answers1

1

Here's a solution which uses functions from this SO post and the plyr package:

library(plyr)

# turn a date into a 'monthnumber' relative to an origin
monnb <- function(d) { 
  lt <- as.POSIXlt(as.Date(d, origin="1900-01-01"))
  lt$year*12 + lt$mon 
  } 

# compute a month difference as a difference between two monnb's
mondf <- function(d1, d2) { monnb(d2) - monnb(d1) }

# iterate over rows of A looking for matches in B
adply(A, 1, function(x)
  B[x$A.COL == B$B.COL & 
      abs(mondf(as.Date(x$A.TIME), as.Date(B$B.TIME))) <= 2,]
)

#     A.COL    A.TIME  B.COL    B.TIME
# 1   blue 2009-01-31  blue 2008-12-31
# 2   blue 2009-01-31  blue 2009-02-28
# 3   blue 2009-02-28  blue 2008-12-31
# 4   blue 2009-02-28  blue 2009-02-28
# 5   blue 2009-02-28  blue 2009-04-30
#  ....

edit: data.table implementation

library(data.table)
merge_AB <- data.table(merge(A,B, by.x = 'A.COL', by.y = 'B.COL'))

merge_AB[,DateDiff := abs(mondf(A.TIME, B.TIME))
       ][DateDiff <= 2]

 #     A.COL     A.TIME     B.TIME DateDiff
 # 1:  blue 2009-01-31 2008-12-31        1
 # 2:  blue 2009-01-31 2009-02-28        1
 # 3:  blue 2009-02-28 2008-12-31        2
 # 4:  blue 2009-02-28 2009-02-28        0
 # 5:  blue 2009-02-28 2009-04-30        2
 # ...

data

A <- read.table(
text = "
A.COL   A.TIME          
blue    2009-01-31     
blue    2009-02-28      
blue    2009-03-31      
blue    2009-04-30      
blue    2009-05-31      
blue    2009-06-30
blue    2016-03-31
blue    2016-04-30
", header = T, stringsAsFactors = FALSE)


B <- read.table(
  text = "
B.COL   B.TIME
blue    2007-01-31
blue    2008-12-31
blue    2009-02-28
blue    2009-04-30
blue    2009-06-30
blue    2016-08-31
", stringsAsFactors = FALSE, header = T)
Community
  • 1
  • 1
bouncyball
  • 10,631
  • 19
  • 31
  • Good job! Only one question: Can this also be done with `datatable`? I am not familiar with `plyr` and I just started with `datatable`, so I wonder if it can be used here to achieve the same sorting? At a sidenote, you did not use `sqldf` to solve this. I wonder: Is not possible in this case? – user3032689 Sep 07 '16 at 09:16
  • @user3032689 Yes we can, see my edit for a `data.table` implementation – bouncyball Sep 07 '16 at 17:44