I am relatively new to R and absolutely new to stackoverflow (having researched a lot here anyway as I have some prior experience in Stata, Excel, VBA and little C).
I have a R dataframe df1 that looks like the following example, just with a few thousand rows:
ID Date Value Class ZIP
TRA0001 2007-09-25 150 1 75019
TRA0002 2002-08-09 200 2 30152
TRA0003 2010-08-31 500 3 12451
TRA0004 2005-06-17 75 1 45242
TRA0005 2010-08-26 410 3 14618
TRA0006 2008-07-07 155 1 70139
TRA0007 2010-01-15 450 3 12883
TRA0008 2000-11-03 80 4 45242
TRA0009 2003-05-01 120 2 63017
TRA0010 2000-10-01 85 5 23712
Each row stands for one transaction. What I need to find are similar transactions to each transaction based on the following combination of "matching criteria" (AND connected):
- Date must be within +/- 18 months, e.g. for TRA0001 the only match would be TRA 0006
- Value must be within +/- 20% of the original row's value, e.g. matches for TRA0001 would be TRA0006 and TRA0009
- Class must be an exact match, e.g. the matches for TRA0001 upon this criterion would be TRA0004 and TRA0006
Note that there can be no match, one match or multiple matches for each transaction/ row. What I need in the end is a list of matches in respect of the combination of the three criteria mentioned above.
For the given example, a result df2 would look like this:
ID ID_Match ZIP_Match
TRA0001 TRA0006 70139
TRA0003 TRA0005 14618
TRA0003 TRA0007 12883
TRA0005 TRA0007 12883
TRA0006 TRA0001 75019
TRA0007 TRA0003 12451
TRA0007 TRA0005 14618
So far, I tried various combinations of duplicate search to get closer to my desired outcome by fulfilling at least one matching criteria and next "filtering down" this result according to the other constraints. I started with the Class condition, as this seemed to me to be the easiest criterion (and probably also the most selective). All I came up in the end was e.g a list of all classes that have duplicates and there respective index positions where the duplicates can be found. For that I used the following code (found on stackoverflow, credits to user "eddi"):
dups = duplicated(df1$Class) | duplicated(d1$Class, fromLast = T)
split(which(dups), df1$Class[dups])
However, this still leaves me miles away from my desired result and I have no idea how to "integrate" the other conditions. Hope I could provide all the necessary information and could make clear my problem. Any kind of hints, suggestions or solutions is more than welcome! Thanks in advance!
Additionally: If someone comes up with an idea how to do the required work with Stata, this would also be welcome - I have slightly slightly more knowledge on Stata than on R.