1

Let's assume I have the following lookup table:

(lkp <- structure(list(a = c("a", "a", "a", "b", "c"),
                       b = c("a1 a2", "a3 a2", "a3", "a1", "a1")), 
                       row.names = c("lkp_1", "lkp_2", "lkp_3", "lkp_4", "lkp_5"), 
                       class = "data.frame"))
#       a     b
# lkp_1 a a1 a2
# lkp_2 a a3 a2
# lkp_3 a    a3
# lkp_4 b    a1
# lkp_5 c    a1 

I want to check if another data.frame, x, say, is a subset of lkp, with the important additional requirement, that for column b matching means that lkp$b need only to contain x$b.

The following example should make clear what I mean:

(chk <- list(c1 = structure(list(a = c("a", "a"), b = c("a2", "a2")), row.names = c(NA, -2L), class = "data.frame"), 
             c2 = structure(list(a = "b", b = "a1"), row.names = c(NA, -1L), class = "data.frame"), 
             c3 = structure(list(a = c("a", "a"), b = c("a1", "a1")), row.names = c(NA, -2L), class = "data.frame"), 
             c4 = structure(list(a = c("a", "a"), b = c("a3", "a2")), row.names = c(NA, -2L), class = "data.frame")))

# $c1
#   a  b
# 1 a a2
# 2 a a2

# $c2
#   a  b
# 1 b a1

# $c3
#   a  b
# 1 a a1
# 2 a a1

# $c4
#   a  b
# 1 a a3
# 2 a a2
  • chk$c1: row 1 matches row lkp_1 (and lkp_2) as column a is the same and lkp$b contains a2
  • chk$c2 and chk$c4 match as well
  • chk$c3 does NOT match. While each row matches lkp_1, c4 is not a subset as lkp would need to contain 2 different rows which match.

In principle I am looking for a merge (or join) where the join condition would use some sort of fuzzy matching.

I have found and read these two SO answers:

And especially the second answer looks promising. However, I do not need approximate matching but rather some sort of does_contain relationship instead of pure equality. So maybe a regex solution would work?

Expected Outcome

magic_is_subset_function <- function(chk, lkp) {
   # ...
}
sapply(chk, magic_is_subset_function, lkp = lkp)
# [1] TRUE TRUE FALSE TRUE
thothal
  • 16,690
  • 3
  • 36
  • 71

1 Answers1

2
sapply(
    chk,
    function(v) {
        sum(
            rowSums(sapply(v$a, `==`, lkp$a) &
                sapply(v$b, grepl, x = lkp$b)) > 0
        ) >= nrow(v)
    }
)

or

sapply(
    chk,
    function(v) {
        sum(
            colSums(
                do.call(
                    `&`,
                    Map(
                        function(x, y) outer(x, y, FUN = Vectorize(function(a, b) grepl(a, b))),
                        v,
                        lkp
                    )
                )
            ) > 0
        ) >= nrow(v)
    }
)

which gives

   c1    c2    c3    c4 
 TRUE  TRUE FALSE FALSE
ThomasIsCoding
  • 96,636
  • 9
  • 24
  • 81