R: Finding similiar/ "duplicate" PAIRS of entries based on multiple variables/ columns/ constraints

Question

I am relatively new to R and absolutely new to stackoverflow (having researched a lot here anyway as I have some prior experience in Stata, Excel, VBA and little C).

I have a R dataframe df1 that looks like the following example, just with a few thousand rows:

ID       Date         Value  Class  ZIP 
TRA0001  2007-09-25   150    1      75019
TRA0002  2002-08-09   200    2      30152
TRA0003  2010-08-31   500    3      12451
TRA0004  2005-06-17   75     1      45242
TRA0005  2010-08-26   410    3      14618
TRA0006  2008-07-07   155    1      70139
TRA0007  2010-01-15   450    3      12883
TRA0008  2000-11-03   80     4      45242
TRA0009  2003-05-01   120    2      63017
TRA0010  2000-10-01   85     5      23712

Each row stands for one transaction. What I need to find are similar transactions to each transaction based on the following combination of "matching criteria" (AND connected):

Date must be within +/- 18 months, e.g. for TRA0001 the only match would be TRA 0006
Value must be within +/- 20% of the original row's value, e.g. matches for TRA0001 would be TRA0006 and TRA0009
Class must be an exact match, e.g. the matches for TRA0001 upon this criterion would be TRA0004 and TRA0006

Note that there can be no match, one match or multiple matches for each transaction/ row. What I need in the end is a list of matches in respect of the combination of the three criteria mentioned above.

For the given example, a result df2 would look like this:

ID       ID_Match   ZIP_Match
TRA0001  TRA0006    70139
TRA0003  TRA0005    14618
TRA0003  TRA0007    12883
TRA0005  TRA0007    12883
TRA0006  TRA0001    75019
TRA0007  TRA0003    12451
TRA0007  TRA0005    14618

So far, I tried various combinations of duplicate search to get closer to my desired outcome by fulfilling at least one matching criteria and next "filtering down" this result according to the other constraints. I started with the Class condition, as this seemed to me to be the easiest criterion (and probably also the most selective). All I came up in the end was e.g a list of all classes that have duplicates and there respective index positions where the duplicates can be found. For that I used the following code (found on stackoverflow, credits to user "eddi"):

dups = duplicated(df1$Class) | duplicated(d1$Class, fromLast = T)
split(which(dups), df1$Class[dups])

However, this still leaves me miles away from my desired result and I have no idea how to "integrate" the other conditions. Hope I could provide all the necessary information and could make clear my problem. Any kind of hints, suggestions or solutions is more than welcome! Thanks in advance!

Additionally: If someone comes up with an idea how to do the required work with Stata, this would also be welcome - I have slightly slightly more knowledge on Stata than on R.

How does TRA0001 matching TRA0004 and TRA0006 count as "exact"? Why not e.g. TRA002? — Nick Cox, Jun 05 '16 at 14:18
@Nick Cox assuming you are talking about the **Class**-condition: Only TRA0004 and TRA0006 have exactly the same value for Class (1) as TRA0001. Instead TRA0002 has another Class value assigned (2) -> No "exact" match to TRA0001 — Lighty, Jun 05 '16 at 14:55

score 2 · Accepted Answer · edited May 23 '17 at 12:31

I think I found a way you can do it. Basically, we define a function that will do what you want for one ID, then use sapply to iterate through all the ID's, then use a call to rbind to put the results together.

The number of months function comes from @Dirk, in this post

df <- read.table(text = 
          "ID       Date         Value  Class  ZIP 
           TRA0001  2007-09-25   150    1      75019
           TRA0002  2002-08-09   200    2      30152
           TRA0003  2010-08-31   500    3      12451
           TRA0004  2005-06-17   75     1      45242
           TRA0005  2010-08-26   410    3      14618
           TRA0006  2008-07-07   155    1      70139
           TRA0007  2010-01-15   450    3      12883
           TRA0008  2000-11-03   80     4      45242
           TRA0009  2003-05-01   120    2      63017
           TRA0010  2000-10-01   85     5      23712", 
           header = T)
# turn a date into a 'monthnumber' relative to an origin
monnb <- function(d) { 
      lt <- as.POSIXlt(as.Date(d, origin="1900-01-01"))
      lt$year*12 + lt$mon
    } 

# compute a month difference as a difference between two monnb's
mondf <- function(d1, d2) { monnb(d2) - monnb(d1) }

find_fxn <- function(data, origID){
  #create subset with ID of interest
  orig_data <- subset(data, ID == origID)
  #subset of all other IDs
  other_data <- subset(data, ID != origID)
  #three matching criteria
  find_first <- which(abs(mondf(orig_data$Date, other_data$Date)) <= 18)
  find_second <- which(other_data$Value >= 0.8 * orig_data$Value & other_data$Value <= 1.2 * orig_data$Value)
  find_third <- which(other_data$Class == orig_data$Class)
  #use intersect to remove dups
  find_all <- intersect(intersect(find_first, find_second), find_third)
  if(length(find_all) > 0){
  cbind.data.frame(ID = orig_data$ID, 
                   IDMatch = other_data[find_all, 1],
                   ZipMatch = other_data[find_all, 5])
  }
}

do.call('rbind', sapply(df$ID, FUN = function(x) find_fxn(data = df, origID = x)))

       ID IDMatch ZipMatch
1 TRA0001 TRA0006    70139
2 TRA0003 TRA0005    14618
3 TRA0003 TRA0007    12883
4 TRA0005 TRA0007    12883
5 TRA0006 TRA0001    75019
6 TRA0007 TRA0003    12451
7 TRA0007 TRA0005    14618

Thank you so much for your quick and profound reply. This absolutely did the trick! I tried to not just copy&past but rather get the algorithm. Your code is not just effective but also very good to understand for a newbie! Could follow everything (until the rbind sapply thing, that still seems a bit like magic to me...) Going to have to validate my results, but I'm really grateful for your answer! Again, thanks a lot! — Lighty, Jun 04 '16 at 18:05
@Lighty no problem! Thanks for writing such a thorough problem description, it made it quite easy to understand what you were trying to do. Your write-up serves as an example of really great SO usage! :-) — bouncyball, Jun 04 '16 at 20:10
bouncyball, just to make sure: I wanted to integrate the time-filtering solution of @Abderyt to your code. Therefor, I replaced your `find_first` line of code with this: `find_first <- which(abs(difftime(as.POSIXct(orig_data$Date), as.POSIXct(other_data$Date), units='days')) <= 1.5 * 365)` Does this sound reasonable? I was wondering if I get a more precise selection by this, as your solution is "only" addressing the full months and that means there can be nearly one month more in the filter-area (e.g. actual transaction on the 31.12.2010 would included all transactions from 01.06.2008 on) — Lighty, Jun 05 '16 at 16:36

score 1 · Answer 2 · answered Jun 05 '16 at 15:25

There's a new user-written program called rangejoin (from SSC) that can be used to easily solve this problem in Stata. In order to use rangejoin, you also have to install rangestat (also from SSC). To install both, type in Stata's command window:

ssc install rangestat
ssc install rangejoin

rangejoin forms all pairwise combinations of observations that fall within a specified range. Since you want to match observations that have the same Class value, the join can be performed within Class groups. Since you have daily dates, I set up the solution to use a window of +/- 548 days (based on 365.25 days a year). Once all pairwise combinations are formed (within the specified time window for each observation), you can drop those that do not match your 20% threshold for Value.

Here's a fully functional example that uses your posted data:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str7 ID str10 Date int Value byte Class str5 ZIP
"TRA0001" "2007-09-25" 150 1 "75019"
"TRA0002" "2002-08-09" 200 2 "30152"
"TRA0003" "2010-08-31" 500 3 "12451"
"TRA0004" "2005-06-17"  75 1 "45242"
"TRA0005" "2010-08-26" 410 3 "14618"
"TRA0006" "2008-07-07" 155 1 "70139"
"TRA0007" "2010-01-15" 450 3 "12883"
"TRA0008" "2000-11-03"  80 4 "45242"
"TRA0009" "2003-05-01" 120 2 "63017"
"TRA0010" "2000-10-01"  85 5 "23712"
end

* convert string date to Stata numeric date
gen ndate = daily(Date, "YMD")
format %td ndate

* save a copy to disk
save "using_copy.do", replace

* match, within the same Class, obs +/- 18 months (365.25 * 1.5 =~ 548 days)
rangejoin ndate -548 548 using "using_copy.do", by(Class) suffix(_Match)

* drop matched ID if amount is off by 20% and match to self
drop if (abs(Value - Value_Match) / Value) > .2
drop if ID == ID_Match

* final results
sort ID ID_Match
list ID ID_Match ZIP_Match, sepby(ID) noobs

And the results:

. list ID ID_Match ZIP_Match, sepby(ID) noobs

  +-------------------------------+
  |      ID   ID_Match   ZIP_Ma~h |
  |-------------------------------|
  | TRA0001    TRA0006      70139 |
  |-------------------------------|
  | TRA0003    TRA0005      14618 |
  | TRA0003    TRA0007      12883 |
  |-------------------------------|
  | TRA0005    TRA0007      12883 |
  |-------------------------------|
  | TRA0006    TRA0001      75019 |
  |-------------------------------|
  | TRA0007    TRA0003      12451 |
  | TRA0007    TRA0005      14618 |
  +-------------------------------+

Thank you very much! Another very convenient and understandable solution. Again, thanks for providing this solution Stata. — Lighty, Jun 05 '16 at 15:39

score 0 · Answer 3 · answered Jun 04 '16 at 14:25

0

First of all use data.table package.

Then you can write simply function, that looks for all similar transaction for the provided one.

On the end loop your dataset to get all similar sets:

dt1 <- data.table::fread('ID       Date         Value  Class  ZIP 
TRA0001  2007-09-25   150    1      75019
TRA0002  2002-08-09   200    2      30152
TRA0003  2010-08-31   500    3      12451
TRA0004  2005-06-17   75     1      45242
TRA0005  2010-08-26   410    3      14618
TRA0006  2008-07-07   155    1      70139
TRA0007  2010-01-15   450    3      12883
TRA0008  2000-11-03   80     4      45242
TRA0009  2003-05-01   120    2      63017
TRA0010  2000-10-01   85     5      23712')

dt1[, Date:=as.POSIXct(Date)]
myTransaction <- dt1[1]
dt1[Class==myTransaction$Class & abs(difftime(Date, myTransaction$Date, units='weeks')) < 4*18 & abs((Value-myTransaction$Value)/pom$Value) < .2]

similar <- lapply(1:nrow(dt1), function(x)
  {
  myTransaction <- dt1[x]
  dt1[ID!=myTransaction$ID & Class==myTransaction$Class & abs(difftime(Date, myTransaction$Date, units='weeks')) < 4*18 & abs((Value-myTransaction$Value)/pom$Value) < .2]
})
names(similar) <- dt1$ID

Use similar[['TRA0006']] to check for similar transactions.

answered Jun 04 '16 at 14:25

Abderyt

109
3

Thank you for your reply, appreciate it! I first tried the solution of @bouncyball and it worked quite fine for me. However, I also will have a look at your suggested way by using the data.table package. (However, this seems to be a quite impressive and massive package and not too easy to get into it... Just my impression from the few minutes research I did so far) – Lighty Jun 04 '16 at 18:15
If you are dealing with huge datasets (millions of rows) you shoud consider using it. – Abderyt Jun 04 '16 at 23:19
thanks again for the reply. Had the time to check your solution again, however didn't really manage to come up with the desired result. If I got it right, you loop the function that finds the similar transactions over the whole data.table and store that in a vector that should indicate all similar transactions to each row, correct? However, it only provides me vector in the length of the data.table but for each entry with the value "NULL". I found one minor typo in your solution (pom$Value, which should be myTransaction$Value instead, I believe), but still didn't work. – Lighty Jun 05 '16 at 13:53
I guess I go for @bouncyballs solution as for my simple mind this is a bit more understandable ;) However, I like your way of checking for the three conditions. Especially the date-filtering in general. If I am right, your way of setting the date limits is somehow imprecise as you approximate every month by 4 weeks. Over 1,5 years, this leads to a difference of 6 weeks (calculate with 48 weeks/year, instead of ~52 real weeks/year). Guess I implement your version however and use the unit "days" instead and the limit 1.5*365 -> Should be also a bit more precise than the solution of bouncyball – Lighty Jun 05 '16 at 14:01

R: Finding similiar/ "duplicate" PAIRS of entries based on multiple variables/ columns/ constraints

3 Answers3