Doing a "fuzzy" and non-fuzzy, many to 1 merge with data.table

Question

Lets assume I have two databases dfA and dfB. One has individual observations and one has country level data (which is applicable to multiple observations which are from the same year and country) For each of these databases I have created a key called matchcode. This matchcode is a combination of a country code and a year.

   dfA <- read.table(
  text = "A   B   C   D   E   F   G   iso   year   matchcode
  1   0   1   1   1   0   1   0   NLD   2010   NLD2010
  2   1   0   0   0   1   0   1   NLD   2014   NLD2014
  3   0   0   0   1   1   0   0   AUS   2010   AUS2010
  4   1   0   1   0   0   1   0   AUS   2006   AUS2006
  5   0   1   0   1   0   1   1   USA   2008   USA2008
  6   0   0   1   0   0   0   1   USA   2010   USA2010
  7   0   1   0   1   0   0   0   USA   2012   USA2012
  8   1   0   1   0   0   1   0   BLG   2008   BLG2008
  9   0   1   0   1   1   0   1   BEL   2008   BEL2008
  10  1   0   1   0   0   1   0   BEL   2010   BEL2010
  11  0   1   1   1   0   1   0   NLD   2010   NLD2010
  12  1   0   0   0   1   0   1   NLD   2014   NLD2014
  13  0   0   0   1   1   0   0   AUS   2010   AUS2010
  14  1   0   1   0   0   1   0   AUS   2006   AUS2006
  15  0   1   0   1   0   1   1   USA   2008   USA2008
  16  0   0   1   0   0   0   1   USA   2010   USA2010
  17  0   1   0   1   0   0   0   USA   2012   USA2012
  18  1   0   1   0   0   1   0   BLG   2008   BLG2008
  19  0   1   0   1   1   0   1   BEL   2008   BEL2008
  20  1   0   1   0   0   1   0   BEL   2010   BEL2010",
  header = TRUE
)

   dfB <- read.table(
  text = "A   B   C   D   H   I   J   iso   year   matchcode
  1   0   1   1   1   0   1   0   NLD   2009   NLD2009
  2   1   0   0   0   1   0   1   NLD   2014   NLD2014
  3   0   0   0   1   1   0   0   AUS   2011   AUS2011
  4   1   0   1   0   0   1   0   AUS   2007   AUS2007
  5   0   1   0   1   0   1   1   USA   2007   USA2007
  6   0   0   1   0   0   0   1   USA   2011   USA2010
  7   0   1   0   1   0   0   0   USA   2013   USA2013
  8   1   0   1   0   0   1   0   BLG   2007   BLG2007
  9   0   1   0   1   1   0   1   BEL   2009   BEL2009
  10   1   0   1   0   0   1   0  BEL   2012   BEL2012",
  header = TRUE
)

library(data.table)
setDT(dfA)
setDT(dfB)

Mostly when I merge these datasets I simply do:

dfA<- merge(dfA, dfB, by= "matchcode", all.x = TRUE, allow.cartesian=FALSE)

The problem is that sometimes the years do not completely match. So I tried:

dfA <- dfA[dfB, on = .(iso, year), roll = "nearest", nomatch = 0]

But this reduces the amount of observations to 11.

# A tibble: 11 x 18
       A     B     C     D     E     F     G iso    year matchcode     K     L     M     N     O     P     Q i.matchcode
   <int> <int> <int> <int> <int> <int> <int> <fct> <int> <fct>     <int> <int> <int> <int> <int> <int> <int> <fct>      
 1     0     1     1     1     0     1     0 NLD    2009 NLD2010       0     1     1     1     0     1     0 NLD2009    
 2     1     0     0     0     1     0     1 NLD    2014 NLD2014       1     0     0     0     1     0     1 NLD2014    
 3     1     0     0     0     1     0     1 NLD    2014 NLD2014       1     0     0     0     1     0     1 NLD2014    
 4     0     0     0     1     1     0     0 AUS    2011 AUS2010       0     0     0     1     1     0     0 AUS2011    
 5     1     0     1     0     0     1     0 AUS    2007 AUS2006       1     0     1     0     0     1     0 AUS2007    
 6     0     1     0     1     0     1     1 USA    2007 USA2008       0     1     0     1     0     1     1 USA2007    
 7     0     0     1     0     0     0     1 USA    2011 USA2010       0     0     1     0     0     0     1 USA2010    
 8     0     1     0     1     0     0     0 USA    2013 USA2012       0     1     0     1     0     0     0 USA2013    
 9     1     0     1     0     0     1     0 BLG    2007 BLG2008       1     0     1     0     0     1     0 BLG2007    
10     0     1     0     1     1     0     1 BEL    2009 BEL2008       0     1     0     1     1     0     1 BEL2009    
11     1     0     1     0     0     1     0 BEL    2012 BEL2010       1     0     1     0     0     1     0 BEL2012

The preferred output would be as follows:

#    A B C D E F G iso year matchcodeA H I J matchcodeB
# 1: 1 0 0 0 1 0 1 NLD  2014  NLD2014  1 0 1    NLD2014
# 2: 0 0 0 1 1 0 0 AUS  2011  AUS2010  1 0 0    AUS2011
# 3: 1 0 1 0 0 1 0 AUS  2007  AUS2006  0 1 0    AUS2007
# 4: 0 0 1 0 0 0 1 USA  2011  USA2010  0 0 1    USA2010
# 5: 0 1 0 1 0 0 0 USA  2013  USA2012  0 0 0    USA2013
# 6: 0 1 0 1 1 0 1 BEL  2009  BEL2008  1 0 1    BEL2009
# 7: 0 1 1 1 0 1 0 NLD  2009  NLD2010  0 1 0    NLD2009
# 8: 0 1 0 1 0 1 1 USA  2007  USA2008  0 1 1    USA2007
# 9: 0 1 0 1 0 0 0 USA  2011  USA2012  0 0 1    USA2010
#10: 1 0 1 0 0 1 0 BEL  2009  BEL2010  1 0 1    BEL2009
#11: 1 0 0 0 1 0 1 NLD  2014  NLD2014  1 0 1    NLD2014
#12: 0 0 0 1 1 0 0 AUS  2011  AUS2010  1 0 0    AUS2011
#13: 1 0 1 0 0 1 0 AUS  2007  AUS2006  0 1 0    AUS2007
#14: 0 0 1 0 0 0 1 USA  2011  USA2010  0 0 1    USA2010
#15: 0 1 0 1 0 0 0 USA  2013  USA2012  0 0 0    USA2013
#16: 0 1 0 1 1 0 1 BEL  2009  BEL2008  1 0 1    BEL2009
#17: 0 1 1 1 0 1 0 NLD  2009  NLD2010  0 1 0    NLD2009
#18: 0 1 0 1 0 1 1 USA  2007  USA2008  0 1 1    USA2007
#19: 0 1 0 1 0 0 0 USA  2011  USA2012  0 0 1    USA2010
#20: 1 0 1 0 0 1 0 BEL  2009  BEL2010  1 0 1    BEL2009

Additional Sources:

1. The previous try

2. The try before that

It seems that you want to match each row from dfA in dfB. Does this give you the desired output: `dfB[dfA, on = .(iso, year), roll = "nearest", nomatch = 0]`? — mt1022, Jan 04 '19 at 11:57
Thanks for taking a look mt1022! It does for the example, but I am still losing about 14.000 observations in the actual dataset regrettably.. — Tom, Jan 04 '19 at 12:08
I guess some code in dfA are not present in B. You can set `nomatch = NA` in join and examine what happens to those rows that get NA values. — mt1022, Jan 04 '19 at 12:12

Wimpel · Accepted Answer · 2019-01-05T06:53:33.917

Hers is my (default) approach for a join like this, using data.table

code

library( data.table )

#change the name of the matchcode-column
setnames(dfA, c("matchcode", "iso", "year"), c("matchcodeA", "isoA", "yearA"))
setnames(dfB, c("matchcode", "iso", "year"), c("matchcodeB", "isoB", "yearB"))

#store column-order for in the end
namesA <- as.character( names( dfA ) )
namesB <- as.character( setdiff( names(dfB), names(dfA) ) )
colorder <- c(namesA, namesB)

#create columns to join on
dfA[, `:=`(iso.join = isoA, year.join = yearA)]
dfB[, `:=`(iso.join = isoB, year.join = yearB)]

#perform left join
result <- dfB[dfA, on = c("iso.join", "year.join"),roll = "nearest" ]

#drop columns that are not needed
result[, grep("^i\\.", names(result)) := NULL ]
result[, grep("join$", names(result)) := NULL ]

#set column order
setcolorder(result, colorder)

result

#     A B C D E F G isoA yearA matchcodeA H I J isoB yearB matchcodeB
#  1: 0 1 1 1 0 1 0  NLD  2010    NLD2010 0 1 0  NLD  2009    NLD2009
#  2: 1 0 0 0 1 0 1  NLD  2014    NLD2014 1 0 1  NLD  2014    NLD2014
#  3: 0 0 0 1 1 0 0  AUS  2010    AUS2010 1 0 0  AUS  2011    AUS2011
#  4: 1 0 1 0 0 1 0  AUS  2006    AUS2006 0 1 0  AUS  2007    AUS2007
#  5: 0 1 0 1 0 1 1  USA  2008    USA2008 0 1 1  USA  2007    USA2007
#  6: 0 0 1 0 0 0 1  USA  2010    USA2010 0 0 1  USA  2011    USA2010
#  7: 0 0 1 0 0 0 0  USA  2012    USA2012 0 0 1  USA  2011    USA2010
#  8: 1 0 1 0 0 1 0  BLG  2008    BLG2008 0 1 0  BLG  2007    BLG2007
#  9: 0 1 0 1 1 0 1  BEL  2008    BEL2008 1 0 1  BEL  2009    BEL2009
# 10: 0 1 0 1 0 1 0  BEL  2010    BEL2010 1 0 1  BEL  2009    BEL2009
# 11: 0 1 1 1 0 1 0  NLD  2010    NLD2010 0 1 0  NLD  2009    NLD2009
# 12: 1 0 0 0 1 0 1  NLD  2014    NLD2014 1 0 1  NLD  2014    NLD2014
# 13: 0 0 0 1 1 0 0  AUS  2010    AUS2010 1 0 0  AUS  2011    AUS2011
# 14: 1 0 1 0 0 1 0  AUS  2006    AUS2006 0 1 0  AUS  2007    AUS2007
# 15: 0 1 0 1 0 1 1  USA  2008    USA2008 0 1 1  USA  2007    USA2007
# 16: 0 0 1 0 0 0 1  USA  2010    USA2010 0 0 1  USA  2011    USA2010
# 17: 0 0 1 0 0 0 0  USA  2012    USA2012 0 0 1  USA  2011    USA2010
# 18: 1 0 1 0 0 1 0  BLG  2008    BLG2008 0 1 0  BLG  2007    BLG2007
# 19: 0 1 0 1 1 0 1  BEL  2008    BEL2008 1 0 1  BEL  2009    BEL2009
# 20: 0 1 0 1 0 1 0  BEL  2010    BEL2010 1 0 1  BEL  2009    BEL2009

sample data

dfA <- fread(
  "A   B   C   D   E   F   G   iso   year   matchcode
  0   1   1   1   0   1   0   NLD   2010   NLD2010
     1   0   0   0   1   0   1   NLD   2014   NLD2014
     0   0   0   1   1   0   0   AUS   2010   AUS2010
     1   0   1   0   0   1   0   AUS   2006   AUS2006
     0   1   0   1   0   1   1   USA   2008   USA2008
     0   0   1   0   0   0   1   USA   2010   USA2010
     0   1   0   1   0   0   0   USA   2012   USA2012
     1   0   1   0   0   1   0   BLG   2008   BLG2008
     0   1   0   1   1   0   1   BEL   2008   BEL2008
    1   0   1   0   0   1   0   BEL   2010   BEL2010
    0   1   1   1   0   1   0   NLD   2010   NLD2010
    1   0   0   0   1   0   1   NLD   2014   NLD2014
    0   0   0   1   1   0   0   AUS   2010   AUS2010
    1   0   1   0   0   1   0   AUS   2006   AUS2006
    0   1   0   1   0   1   1   USA   2008   USA2008
    0   0   1   0   0   0   1   USA   2010   USA2010
    0   1   0   1   0   0   0   USA   2012   USA2012
    1   0   1   0   0   1   0   BLG   2008   BLG2008
    0   1   0   1   1   0   1   BEL   2008   BEL2008
    1   0   1   0   0   1   0   BEL   2010   BEL2010",
  header = TRUE
)


dfB <- fread(
  "A   B   C   D   H   I   J   iso   year   matchcode
     0   1   1   1   0   1   0   NLD   2009   NLD2009
     1   0   0   0   1   0   1   NLD   2014   NLD2014
     0   0   0   1   1   0   0   AUS   2011   AUS2011
     1   0   1   0   0   1   0   AUS   2007   AUS2007
     0   1   0   1   0   1   1   USA   2007   USA2007
     0   0   1   0   0   0   1   USA   2011   USA2010
     0   1   0   1   0   0   0   USA   2013   USA2013
     1   0   1   0   0   1   0   BLG   2007   BLG2007
     0   1   0   1   1   0   1   BEL   2009   BEL2009
     1   0   1   0   0   1   0  BEL   2012   BEL2012",
  header = TRUE
)

I removed a previous column because I thought something was wrong. On a more thorough inspection it did however work just fine. Thank you so much for your help. A very small question. I noticed that apart from `matchcodeB`, also `year` and `iso`a are empty (this is my confusion arose). Is it possible to a adapt the code to leave `iso` and `year` from `dfA` intact? — Tom, Jan 05 '19 at 02:32
In the hope you are still here: I have been using your answer to merge many databases. For example A with number 1-8 and then B with number 1 to 8. The first merge goes well but after I am getting problems with duplicates in the `colorder`. I'm breaking my head on solving it. Would you have any idea why it happens and how to prevent it? Should I also remove the `i.` and `.join` from the other databases? — Tom, Jan 05 '19 at 07:13
Nevermind, I fixed it by dropping all variables which are not needed from each `dfB` as well before doing the second merging. That appears to work. — Tom, Jan 05 '19 at 07:29

Doing a "fuzzy" and non-fuzzy, many to 1 merge with data.table

1 Answers1

Linked