Check if data.frame is a subset of another data.frame

Question

Let's assume I have the following lookup table:

(lkp <- structure(list(a = c("a", "a", "a", "b", "c"),
                       b = c("a1 a2", "a3 a2", "a3", "a1", "a1")), 
                       row.names = c("lkp_1", "lkp_2", "lkp_3", "lkp_4", "lkp_5"), 
                       class = "data.frame"))
#       a     b
# lkp_1 a a1 a2
# lkp_2 a a3 a2
# lkp_3 a    a3
# lkp_4 b    a1
# lkp_5 c    a1

I want to check if another data.frame, x, say, is a subset of lkp, with the important additional requirement, that for column b matching means that lkp$b need only to contain x$b.

The following example should make clear what I mean:

(chk <- list(c1 = structure(list(a = c("a", "a"), b = c("a2", "a2")), row.names = c(NA, -2L), class = "data.frame"), 
             c2 = structure(list(a = "b", b = "a1"), row.names = c(NA, -1L), class = "data.frame"), 
             c3 = structure(list(a = c("a", "a"), b = c("a1", "a1")), row.names = c(NA, -2L), class = "data.frame"), 
             c4 = structure(list(a = c("a", "a"), b = c("a3", "a2")), row.names = c(NA, -2L), class = "data.frame")))

# $c1
#   a  b
# 1 a a2
# 2 a a2

# $c2
#   a  b
# 1 b a1

# $c3
#   a  b
# 1 a a1
# 2 a a1

# $c4
#   a  b
# 1 a a3
# 2 a a2

chk$c1: row 1 matches row lkp_1 (and lkp_2) as column a is the same and lkp$b contains a2
chk$c2 and chk$c4 match as well
chk$c3 does NOT match. While each row matches lkp_1, c4 is not a subset as lkp would need to contain 2 different rows which match.

In principle I am looking for a merge (or join) where the join condition would use some sort of fuzzy matching.

I have found and read these two SO answers:

And especially the second answer looks promising. However, I do not need approximate matching but rather some sort of does_contain relationship instead of pure equality. So maybe a regex solution would work?

Expected Outcome

magic_is_subset_function <- function(chk, lkp) {
   # ...
}
sapply(chk, magic_is_subset_function, lkp = lkp)
# [1] TRUE TRUE FALSE TRUE

Thanks for the comment,I added the expected outcome. – thothal Jul 17 '21 at 21:57 — thothal, Jul 17 '21 at 21:57

ThomasIsCoding · Accepted Answer · 2021-07-18T21:15:35.967

2

sapply(
    chk,
    function(v) {
        sum(
            rowSums(sapply(v$a, `==`, lkp$a) &
                sapply(v$b, grepl, x = lkp$b)) > 0
        ) >= nrow(v)
    }
)

or

sapply(
    chk,
    function(v) {
        sum(
            colSums(
                do.call(
                    `&`,
                    Map(
                        function(x, y) outer(x, y, FUN = Vectorize(function(a, b) grepl(a, b))),
                        v,
                        lkp
                    )
                )
            ) > 0
        ) >= nrow(v)
    }
)

which gives

   c1    c2    c3    c4 
 TRUE  TRUE FALSE FALSE

edited Jul 18 '21 at 21:15

answered Jul 17 '21 at 22:04

ThomasIsCoding

96,636
9
24
81

Nice idea, but does not work in case `lkp` contains more matching rows: (think of `lkp <- rbind(lkp, lkp[4, ])`). Any ideas? – thothal Jul 18 '21 at 12:55
1

What's the expected output with your new `lkp`? – ThomasIsCoding Jul 18 '21 at 18:57
Same ad before `TRUE TRUE FALSE TRUE`. BTW: `c4` evaluates to `FALSE` in your case. it shoudl return `TRUE`. – thothal Jul 18 '21 at 20:49
@thothal Why it should be `TRUE`? – ThomasIsCoding Jul 18 '21 at 20:52
`a a3` matches `lkp_3` and `a a2` matches both `lkp_1` and `lkp_2` (one match would be already sufficient). Rule is: first column must match as is, second column must be contained. – thothal Jul 18 '21 at 21:05
@thothal Okay, I see. You can use `>=nrow(v)`, rather than `==`. – ThomasIsCoding Jul 18 '21 at 21:15

Check if data.frame is a subset of another data.frame

Expected Outcome

1 Answers1