how to identify failure cases while merging two files?

Question

I have merged two dataframes in R using the following command.

tmp <- merge(df_a, df_b, by.x = c('colA'), by.y = c('colB'))

To my surprise more than 50% of the rows failed to merge which is not expected. I would like to identify the cases(from df_a, df_b) which failed to merge based on the given condition. Is there any way possible in R or we have to do command-line operations only ?

Edit: This can be achieved using one of the solution in this question and many other information can be obtained using the join functions!

If there is a column in `df_b` that does not have missing values, you can set the `all.x=TRUE` in the `merge` call, and then check for `NA`s in the `df_b` column of the result. You might also find the `anti_join` and `semi_join` functions in the **dplyr** package useful. — Matthew Plourde, Dec 02 '15 at 15:06
Yeah, the `all = TRUE` argument in `merge` is what you need here. It will return all unmatched values as `NA`, which you can then search for with `is.na`. Take a look at `?merge` for more details. — giraffehere, Dec 02 '15 at 15:11
This question as written has nothing to do with awk or sed. If it did you'd need to provide sample input and expected output and a completely different description for anyone to stand a chance of helping you use those tools to do whatever it is you are trying to do. So I'm deleting those tags. — Ed Morton, Dec 02 '15 at 15:35

score 2 · Answer 1 · answered Dec 02 '15 at 15:16

2

The dplyr package has the anti_join function designed for this task:

library(dplyr)

# two sample DFs, each one with an unmatched row
df_a <- data_frame(row=1:3, colA=c('a', 'b', 'c'))
df_b <- data_frame(row=1:3, colA=c('a', 'b', 'd'))

anti_join(df_a, df_b)
#     row  colA
#   (int) (chr)
# 1     3     c

anti_join(df_b, df_a)
#     row  colA
#   (int) (chr)
# 1     3     d

answered Dec 02 '15 at 15:16

Matthew Plourde

43,932
7
96
113

Thanks @Matthew for the `dplyr` ideas. – Prradep Dec 02 '15 at 16:44

score 1 · Accepted Answer · answered Dec 02 '15 at 15:12

You can use all=TRUE

df_a <- data.frame(colA=2*(1:10),datA=1:10)
df_b <- data.frame(colB=3*(1:10),datB=1:10)

tmp <- merge(df_a, df_b, by.x = c('colA'), by.y = c('colB'))
#   colA datA datB
# 1    6    3    2
# 2   12    6    4
# 3   18    9    6

tmp1 <- merge(df_a, df_b, by.x = c('colA'), by.y = c('colB'),all=TRUE)
#    colA datA datB
# 1     2    1   NA
# 2     3   NA    1
# 3     4    2   NA
# 4     6    3    2
# 5     8    4   NA
# 6     9   NA    3
# 7    10    5   NA
# 8    12    6    4
# 9    14    7   NA
# 10   15   NA    5
# 11   16    8   NA
# 12   18    9    6
# 13   20   10   NA
# 14   21   NA    7
# 15   24   NA    8
# 16   27   NA    9
# 17   30   NA   10

Or you can just check directly using %in%:

df_a[!df_a$colA %in% df_b$colB,]
#    colA datA
# 1     2    1
# 2     4    2
# 4     8    4
# 5    10    5
# 7    14    7
# 8    16    8
# 10   20   10

df_b[!df_b$colB %in% df_a$colA,]
#    colB datB
# 1     3    1
# 3     9    3
# 5    15    5
# 7    21    7
# 8    24    8
# 9    27    9
# 10   30   10

Thanks for the advice. Shame on me, completely forgot the `%in%` operator. — Prradep, Dec 02 '15 at 16:43

how to identify failure cases while merging two files?

2 Answers2