3

I have two data tables which have partly similar column names:

   dfA <- read.table(
  text = "A   B   C   D   E   F   G   iso   year   matchcode
  1   0   1   1   1   0   1   0   NLD   2010   NLD2010
  2   1   0   0   0   1   0   1   NLD   2014   NLD2014
  3   0   0   0   1   1   0   0   AUS   2010   AUS2010
  4   1   0   1   0   0   1   0   AUS   2006   AUS2006
  5   0   1   0   1   0   1   1   USA   2008   USA2008
  6   0   0   1   0   0   0   1   USA   2010   USA2010
  7   0   1   0   1   0   0   0   USA   2012   USA2012
  8   1   0   1   0   0   1   0   BLG   2008   BLG2008
  9   0   1   0   1   1   0   1   BEL   2008   BEL2008
  10   1   0   1   0   0   1   0  BEL   2010   BEL2010",
  header = TRUE
)

   dfB <- read.table(
  text = "A   B   C   D   H   I   J   iso   year   matchcode
  1   0   1   1   1   0   1   0   NLD   2009   NLD2009
  2   1   0   0   0   1   0   1   NLD   2014   NLD2014
  3   0   0   0   1   1   0   0   AUS   2011   AUS2011
  4   1   0   1   0   0   1   0   AUS   2007   AUS2007
  5   0   1   0   1   0   1   1   USA   2007   USA2007
  6   0   0   1   0   0   0   1   USA   2011   USA2010
  7   0   1   0   1   0   0   0   USA   2013   USA2013
  8   1   0   1   0   0   1   0   BLG   2007   BLG2007
  9   0   1   0   1   1   0   1   BEL   2009   BEL2009
  10   1   0   1   0   0   1   0  BEL   2012   BEL2012",
  header = TRUE
)
library(data.table)
setDT(dfA)
setDT(dfB)

To merge the data.tables I will do the following:

dfA <- dfA[dfB, on = .(iso, year), roll = "nearest", nomatch = 0]

This will however, apart from the desired duplicate column matchcode also create the undesired duplicate columns A, B, C, D. Because of the number of merges I need to do, that would get too messy.

Is there a way to exclude duplicate columns from the merging process without explicitly referring to them? If not, how can I do so by explicitly referring to them. If not, can I remove them afterwards without explicitly referring to the duplicates? For example by removing all columns which look like `i.columnname' ?

The preferred output would be as follows:

#    A B C D E F G iso year matchcodeA H I J matchcodeB
# 1: 1 0 0 0 1 0 1 NLD  2014  NLD2014  1 0 1    NLD2014
# 2: 0 0 0 1 1 0 0 AUS  2011  AUS2010  1 0 0    AUS2011
# 3: 1 0 1 0 0 1 0 AUS  2007  AUS2006  0 1 0    AUS2007
# 4: 0 0 1 0 0 0 1 USA  2011  USA2010  0 0 1    USA2010
# 5: 0 1 0 1 0 0 0 USA  2013  USA2012  0 0 0    USA2013
# 6: 0 1 0 1 1 0 1 BEL  2009  BEL2008  1 0 1    BEL2009
# 7: 0 1 1 1 0 1 0 NLD  2009  NLD2010  0 1 0    NLD2009
# 8: 0 1 0 1 0 1 1 USA  2007  USA2008  0 1 1    USA2007
# 9: 0 1 0 1 0 0 0 USA  2011  USA2012  0 0 1    USA2010
#10: 1 0 1 0 0 1 0 BEL  2009  BEL2010  1 0 1    BEL2009
M--
  • 25,431
  • 8
  • 61
  • 93
Tom
  • 2,173
  • 1
  • 17
  • 44
  • My apologies for the mistake, I was really quickly making the table by hand, as my R is running huge imputations now so I cannot use it for hours to come. If the matchcode is there that should be enough! – Tom Jan 01 '19 at 05:36
  • It's okay. I thought it was a typo – akrun Jan 01 '19 at 05:37

1 Answers1

1

We can create an index of column names that are common with intersecgt

nm1 <- intersect(names(dfA), names(dfB))

then, use setdiff to find the column names that are found in 'dfB' and not in the 'nm1' while including the joining columns 'iso' 'year' as well as the 'matchcode'

nm2 <- c(setdiff(names(dfB), nm1), "iso", "year", "matchcode")

Now, we do the join

out <- dfA[dfB[, ..nm2], on = .(iso, year), roll = "nearest", nomatch = 0]
setnames(out, c('matchcode', 'i.matchcode'), c('matchcodeA', 'matchcodeB'))
akrun
  • 874,273
  • 37
  • 540
  • 662
  • This is great akrun, thank you so much! I guess I could just manually remove the matchcode column from the index of column names to keep that information right? – Tom Jan 01 '19 at 05:32
  • @Tom I thought you didn't need the `matchcode` as well. I edited it to include that column as well – akrun Jan 01 '19 at 05:33
  • 1
    Thanks you so much! I just realised after asking my question that I might want to keep that information. – Tom Jan 01 '19 at 05:34
  • Hi akrun, I have a small question. I have been trying to apply your solution (and also my own attempt from this post) to a few databases. I have noticed that it greatly reduces the number of observations. Apparently it does not completely do what I thought it would (only adding information to other rows). I cannot really figure out why.. Would you have any idea? – Tom Jan 04 '19 at 10:53
  • Is it possible that it does not work if `iso` and `year` are not unique in the dataset? I started out by simply doing: `dfA<- merge(dfA, dfB, by= "matchcode", all.x = TRUE, allow.cartesian=FALSE)` – Tom Jan 04 '19 at 11:10
  • @Tom It should be unique in at least one of the dataset. Otherwise, the matching becomes non-specific. Please check whether you have duplicate entries for the column with `table(dfA$matchcode)`, `table(dfB$matchcode)` – akrun Jan 04 '19 at 18:43
  • Hey akrun, thanks for your response. It is unique for `dfB`. I posted a new question about it: https://stackoverflow.com/questions/54038311/doing-a-fuzzy-and-non-fuzzy-many-to-1-merge-with-data-table/54039174#54039174 . Someone posted a solution but it leaves me with empty entries in the matched columns somehow. – Tom Jan 05 '19 at 01:59