Merge two data sets with indicator variable on Non-merges

Question

I have two data sets. Lets call them Master data and Using data. I merged the using data into the master data using 3 keys. I included the all=TRUE argument to keep all merged and non-merged observations

merged_data <- merge(dataset1, dataset2, by=c("key1","key2","key3"), all=T)

Now my question. How do I get information on whether a non-merge happened in Master data or the Using data? Ideally, I want a column that has value

1) 1 if non merge happened due to Using data 2) 2 if non merge happened due to Master Data 3) 3 if Successful Merge!!

Thanks, Kedar

Hi Kedar! Please define how a non-merge happens in one dataset and not the other? If there is no match in one of the data frames, by definition there shouldn't be a match in the other data frame. Perhaps provide example data and the expected output. Perhaps also read here: https://stackoverflow.com/a/5963610/1842673 — TobiO, Jan 13 '20 at 08:17

Rohit · Answer 1 · 2020-01-13T10:36:17.030

The procedure is the same for any number of keys. You need a column in each table that has no NAs(missing values). (Edit: If no such column exists in one of the tables, the you can create a dummy column like so: dfa$dummy1 <-1)

After the merge, these columns will contain NAs if that row doesn't have a match in the other table. Then you can just search for these rows in the merged table, and assign the appropriate flags:

set.seed(123)
dfa <- data.frame(id=c(1,2,3,5),val1=rnorm(4))
dfa
#>   id        val1
#> 1  1 -0.56047565
#> 2  2 -0.23017749
#> 3  3  1.55870831
#> 4  5  0.07050839
dfb <- data.frame(id=c(1,2,4,5),val2=rnorm(4))
dfb
#>   id       val2
#> 1  1  0.1292877
#> 2  2  1.7150650
#> 3  4  0.4609162
#> 4  5 -1.2650612
dfc <- merge(dfa,dfb,all=T)

dfc$flag <- 3 # Assume all rows are matched by default
dfc$flag[is.na(dfc$val1)] <- 1 # Search for NAs in val1 for no-match due to dfA
dfc$flag[is.na(dfc$val2)] <- 2 # Search for NAs in val2 for no-match due to dfB
dfc
#>   id        val1       val2 flag
#> 1  1 -0.56047565  0.1292877    3
#> 2  2 -0.23017749  1.7150650    3
#> 3  3  1.55870831         NA    2
#> 4  4          NA  0.4609162    1
#> 5  5  0.07050839 -1.2650612    3

^{Created on 2020-01-13 by the reprex package (v0.3.0)}

@Georgery, I've mentioned this in my answer. If no such columns exist, then making a dummy column for each table is easy enough. — Rohit, Jan 13 '20 at 10:34
True. Sorry. I guess I simply didn't read thoroughly. I will delete the comment. — Georgery, Jan 13 '20 at 10:35

score 0 · Answer 2 · answered Jan 13 '20 at 10:29

The tidyverse can help here.

anti_join()

"return all rows from x where there are not matching values in y, keeping just columns from x."

Hence,

dataset1 %>%
    anti_join(dataset2, by = c("joinColumn")

should give you what's in 1 but not joined by 2, and

dataset2 %>%
    anti_join(dataset1, by = c("joinColumn")

should give you what's in 2 but not joined by 1.

score 0 · Answer 3 · answered Jan 14 '20 at 05:12

Probably inefficient, but worked well.

dataset1$ind_data1 <-1 
dataset2$ind_data2 <-1

merged_data_all <- merge(dataset1, dataset2, 
by=c("key1","key2","key3", "Year"), all=T)
View(merged_data_all)


merged_data_all$ind_data1[is.na(merged_data_all$ind_data1)] <- 0
merged_data_all$ind_data2[is.na(merged_data_all$ind_data2)] <- 0

merge <- vector()
for(i in 1:length(merged_data_all$ind_data1)) {
if(merged_data_all$ind_data1[i]==1 & merged_data_all$ind_data2[i]==1) {
merge[i] = "match"
} else if(merged_data_all$ind_data1[i]==0 & merged_data_all$ind_data2[i]==1){
merge[i]="dataset 2 only"
} else{
merge[i]="dataset 1 only"
}
}

merged_data_all<- cbind(merged_data_all, merge)

Merge two data sets with indicator variable on Non-merges

3 Answers3