Doing a "fuzzyjoin" (and non-fuzzyjoin) in combination with a merge in data.table

Question

I am using multiple databases. For each of these databases I have created a key called matchcode. This matchcode is a combination of a country code and a year. Mostly when I merge these datasets I simply do:

dfA<- merge(dfA, dfB, by= "matchcode", all.x = TRUE, allow.cartesian=FALSE)

The problem is that sometimes the years do not completely match:

   dfA <- read.table(
  text = "A   B   C   D   E   F   G   iso   year   matchcode
  1   0   1   1   1   0   1   0   NLD   2010   NLD2010
  2   1   0   0   0   1   0   1   NLD   2014   NLD2014
  3   0   0   0   1   1   0   0   AUS   2010   AUS2010
  4   1   0   1   0   0   1   0   AUS   2006   AUS2006
  5   0   1   0   1   0   1   1   USA   2008   USA2008
  6   0   0   1   0   0   0   1   USA   2010   USA2010
  7   0   1   0   1   0   0   0   USA   2012   USA2012
  8   1   0   1   0   0   1   0   BLG   2008   BLG2008
  9   0   1   0   1   1   0   1   BEL   2008   BEL2008
  10   1   0   1   0   0   1   0  BEL   2010   BEL2010",
  header = TRUE
)

   dfB <- read.table(
  text = "K   L   M   N   O   P   Q   iso   year   matchcode
  1   0   1   1   1   0   1   0   NLD   2009   NLD2009
  2   1   0   0   0   1   0   1   NLD   2014   NLD2014
  3   0   0   0   1   1   0   0   AUS   2011   AUS2011
  4   1   0   1   0   0   1   0   AUS   2007   AUS2007
  5   0   1   0   1   0   1   1   USA   2007   USA2007
  6   0   0   1   0   0   0   1   USA   2011   USA2010
  7   0   1   0   1   0   0   0   USA   2013   USA2013
  8   1   0   1   0   0   1   0   BLG   2007   BLG2007
  9   0   1   0   1   1   0   1   BEL   2009   BEL2009
  10   1   0   1   0   0   1   0  BEL   2012   BEL2012",
  header = TRUE
)

I am looking for a data.table solution which does something similar to the following answer by Jaap :

library(data.table)
setDT(dfA)
setDT(dfB)

dfA[dfB
       , on = .(iso, year)
       , roll = "nearest"
       , .(ID, year.x = i.year, year.y = x.year, value, delta = abs(i.year- x.year))]

I would however, in addition to this solution like to:

Have all columns of both databases in the new data.table.
Have the roll limited to +1 | -1. However when I enter this, it does not seem to apply this correctly.

For 1. I apparently need to use mget, again thanks to Jaap:

dfA[dfB, on = .(iso, year), names(dfB)[1:10] := 
        mget(paste0("i.", names(dfB)[1:10]))]

I however cannot seem to be able to put this together successfully. I tried:

dfA[dfB, on = .(iso, year), roll = "nearest", names(dfB)[1:10] := 
   mget(paste0("i.", names(dfB)[1:10])),
   .(matchcode, year.x = i.year, year.y = x.year, delta = abs(i.year - x.year))]

But this gives:

Error in eval(bysub, xss, parent.frame()) : object 'i.year' not found.

What does work is the following:

 dfA[dfB
     , on = .(iso, year)
     , roll = "nearest"
     , .(matchcode, year.x = i.year, year.y = x.year, delta = abs(i.year - x.year))]
    matchcode year.x year.y delta
 1:   NLD2010   2009   2010     1
 2:   NLD2014   2014   2014     0
 3:   AUS2010   2011   2010     1
 4:   AUS2006   2007   2006     1
 5:   USA2008   2007   2008     1
 6:   USA2010   2011   2010     1
 7:   USA2012   2013   2012     1
 8:   BLG2008   2007   2008     1
 9:   BEL2008   2009   2008     1
10:   BEL2010   2012   2010     2

Any suggestions on how to proceed?

mt1022 · Accepted Answer · 2018-12-26T12:46:51.597

2

Hope this works for you:

dfA[, yearA := year]

res1 <- dfA[dfB, on = .(iso, year), roll = 1, nomatch = 0]
res2 <- dfA[dfB, on = .(iso, year), roll = -1, nomatch = 0]
res <- rbind(res1, res2[yearA > year])
setnames(res, c('year', 'matchcode', 'i.matchcode'), c('yearB', 'matchcodeA', 'matchcodeB'))

#    A B C D E F G iso yearB matchcodeA yearA K L M N O P Q matchcodeB
# 1: 1 0 0 0 1 0 1 NLD  2014    NLD2014  2014 1 0 0 0 1 0 1    NLD2014
# 2: 0 0 0 1 1 0 0 AUS  2011    AUS2010  2010 0 0 0 1 1 0 0    AUS2011
# 3: 1 0 1 0 0 1 0 AUS  2007    AUS2006  2006 1 0 1 0 0 1 0    AUS2007
# 4: 0 0 1 0 0 0 1 USA  2011    USA2010  2010 0 0 1 0 0 0 1    USA2010
# 5: 0 1 0 1 0 0 0 USA  2013    USA2012  2012 0 1 0 1 0 0 0    USA2013
# 6: 0 1 0 1 1 0 1 BEL  2009    BEL2008  2008 0 1 0 1 1 0 1    BEL2009
# 7: 0 1 1 1 0 1 0 NLD  2009    NLD2010  2010 0 1 1 1 0 1 0    NLD2009
# 8: 0 1 0 1 0 1 1 USA  2007    USA2008  2008 0 1 0 1 0 1 1    USA2007
# 9: 0 1 0 1 0 0 0 USA  2011    USA2012  2012 0 0 1 0 0 0 1    USA2010
# 10: 1 0 1 0 0 1 0 BLG  2007    BLG2008  2008 1 0 1 0 0 1 0    BLG2007
# 11: 1 0 1 0 0 1 0 BEL  2009    BEL2010  2010 0 1 0 1 1 0 1    BEL2009

edited Dec 26 '18 at 12:46

answered Dec 26 '18 at 12:08

mt1022

16,834
5
48
71

Thanks mt1022! Very interesting solution! I do however get the error that yearA is not yet specified `Error in .checkTypos(e, names(x)) : Object 'yearA' not found. Perhaps you intended year` . I thought you might meant to add something like `colnames(res1)[9]<-"yearA"` but that did not work haha.. – Tom Dec 26 '18 at 12:23
@Tom, sorry, missed one line when copy-paste from studio :(. fixed now. – mt1022 Dec 26 '18 at 12:47
Thank you so much mt1022. One more thing. I notice that USA 2011 is now matched twice, once with 2010 and 2012. Is there anyway to give preference to one or the other? I tried playing around with the `>` sign but that did not help. For the example at hand one could simply remove the duplication in `matchcodeA`, however because of my actual dataset this does not work (because `matchcodeA` is not meant to be unique in the dataset). – Tom Dec 26 '18 at 13:01
1

@Tom, a possible way is to add a new column to `dfA` by `dfA[, rowA := 1:.N]`. After joining, you can then make sure that each row in A is only used once by removing duplicated `rowA` values. – mt1022 Dec 26 '18 at 13:05
Thanks again! That should work because `dfA` should normally have a unique ID somewhere in the dataset. You have been a great help, thank you so much! – Tom Dec 26 '18 at 13:14
Hey mt1022, I have small question. In my example I have overlooked the fact that the combination of `year` and `iso` code are not unique in my actual dataset. The result when applying the solution is that I am losing many observations. Do you have any idea what to change to prevent this? – Tom Jan 04 '19 at 11:15
@Tom, I am not sure about your actual situation. Have you tried the suggestion in my last comment? – mt1022 Jan 04 '19 at 11:17
I did, I made it (for now) even simpler, by just using`nearest`. I think the problem is that dfA are individual observations, where dfB is country level data. So there will for example be a thousand matches in dfA for each row in dfB. I guess I should make a new question about it right? – Tom Jan 04 '19 at 11:23
Yes. Feel free to post a new question with a small example that addresses the issue of non-unique values. I would be happy to help if I have any thoughts. – mt1022 Jan 04 '19 at 11:31
https://stackoverflow.com/questions/54038311/doing-a-fuzzy-and-non-fuzzy-many-to-1-merge-with-data-table – Tom Jan 04 '19 at 11:44

Doing a "fuzzyjoin" (and non-fuzzyjoin) in combination with a merge in data.table

1 Answers1

Linked