2

I am trying to figure out how to use a statement similar to %LIKE% in the R programming language. Using the following stackoverflow post: How to join (merge) data frames (inner, outer, left, right), I was able to figure out how to run basic merges in SQL. But for some reason, this does not work with the %LIKE% condition.

For instance, if I create the following data:

table_a <- data.frame (

"name" = c("John", "ALex", "ToM", "Kev", "Peter"),
"color" = c("red", "blue", "green", "yellow", "pink")

)

name$table_a = as.factor(name$table_a)
    
table_b <- data.frame (

"name" = c("Johnathan", "Alexander", "Tomas", "Kevin", "Luke", "Ryan"),
"food" = c("pizza", tacos", "sushi", "cake", "brownies", "burgers")

)

name$table_b = as.factor(name$table_b)

table_c <- data.frame (

"name" = c("Johnatha", "Alexande1", "Toma1", "Kevi1", "Luk1"),
"food" = c("pizza", tacos", "sushi", "cake", "brownies")

)

name$table_c = as.factor(name$table_c)

Now what I want to do, is run a "left join" if the name in table_a is contained somewhere within the name in table_b. (using the same logic, it should also be possible to use the one-sided %LIKE as well?)

#Left joins

join_1 =  merge(x = table_a, y = table_b, by = "%name%", all.x = TRUE)

join_2 =  merge(x = table_b, y = table_c, by = "%name", all.x = TRUE)

In regular SQL statements, it is generally straightforward to select rows of data if they meet a condition specified by %LIKE%. Is the same thing possible in R?

# select using %LIKE% (is there a way to override "case sensitivity" ? e.g. %like% "jOn"?)

selected_1 = table_a[name %like% "Jon"|| "Ale" || "Pet"]
selected_2 = table_a[name %like% "Jon"|| "Ale" || "Pet" || color %like% "ye"]

Thanks

  • 2
    It seems like you have two questions here, one about merging data and one about selecting data. For the latter, "data.table" has `%like%` but to access case sensitivity, you have to use the `like()` version. For example, you can do `library(data.table); as.data.table(table_b)[like(name, "joh|al|tom", ignore.case = TRUE) & like(food, "pizza")]`. – A5C1D2H2I1M1N2O1R2T1 Dec 16 '20 at 23:50
  • is there a way to do this for a full match? e..g NEW = (table_b) [ (name = "john|tom|alex") $ (food = "pizza)] – stats_noob Dec 17 '20 at 01:17
  • 1
    Can you please spend some time making sure the code you give us actually works? Examples: `name$table_a` should be `table_a$name`; and `tacos"` should be `"tacos"`. (I believe.) It can be distracting parsing through typos like that in a question, and wondering what of your problems are due to other typos like that in your real (not-visible) code and how much is related to something else. – r2evans Dec 17 '20 at 01:49

2 Answers2

3

I think you will probably need to use other R functions to achieve something which is not done natively with merge. grepl is the main function that checks if one string is found in another. If you want other patterns you can use startsWith (LIKE%) or endsWith (%LIKE), instead of grepl.

#sapply iterates over the names of table_a
table_a$name_b <- sapply(table_a$name, function(x)  {
                          #check to see if the names of table_a 
                          #are included within table_b
                          #which(...)[1] selects the first instance
                          check <- which(grepl(x, table_b$name, ignore.case = TRUE))[1]
                          #filter table_b and return what matched.
                          table_b$name[check]
                        })

Output:

# table_a
#   name  color    name_b
#1  John    red Johnathan
#2  Alex   blue Alexander
#3   Tom  green     Tomas
#4   Kev yellow     Kevin
#5 Peter   pink     <NA>
                  
LyzandeR
  • 37,047
  • 12
  • 77
  • 87
  • Thank you for your reply! I guess there is no straight forward way to do this .... is there a way to add the "food" column from table_b to your answer? –  Dec 17 '20 at 01:14
  • Of course! Instead of `table_b$name[check]`, you just add column food, i.e. `table_b$food[check]. – LyzandeR Dec 17 '20 at 08:59
3

You can get fairly close with fuzzyjoin:

library(fuzzyjoin)
regex_right_join(table_b, table_a, by = "name", ignore_case = TRUE)
# Warning: Coercing `pattern` to a plain character vector.
#      name.x  food name.y  color
# 1 Johnathan pizza   John    red
# 2 Alexander tacos   ALex   blue
# 3     Tomas sushi    ToM  green
# 4     Kevin  cake    Kev yellow
# 5      <NA>  <NA>  Peter   pink

The fuzzyjoin package will rename the by= variables here, for the sake of unambiguity. While this might seem inconvenient, I personally find its lack of ambiguity very safe to use. Especially here, where the two will often not be the same (since one is, by definition of our intent, a "pattern" to test against the other).

BTW: [name %like% "Jon"|| "Ale" || "Pet"] is not going to do what you hope: it is first producing a logical vector based on name %like% "Jon", and then OR-ing that vector as one whole element against "Ale" (which is an error), etc. What you might be looking for is a regex-based %in%, which is not quite that. Further, || is single-element OR, | is a vectorized OR, I suspect you want the latter.

For a regex-like %in%, you'd want something like:

sapply(c("Jon", "Ale", "Pet"), grepl, x = table_a$name)
#        Jon   Ale   Pet
# [1,] FALSE FALSE FALSE
# [2,] FALSE FALSE FALSE
# [3,] FALSE FALSE FALSE
# [4,] FALSE FALSE FALSE
# [5,] FALSE FALSE  TRUE

table_a[rowSums(sapply(c("Jon", "Ale", "Pet"), grepl, x = table_a$name)) > 0,]
#    name color
# 5 Peter  pink
table_a[rowSums(sapply(c("Jon", "Ale", "Pet"), grepl, x = table_a$name)) > 0 |
          grepl("ye", table_a$color),]
#    name  color
# 4   Kev yellow
# 5 Peter   pink

Which can be turned into a small inline operator like this:

`%regexin%` <- function(lhs, rhs) rowSums(sapply(rhs, grepl, x = as.character(lhs))) > 0
`%iregexin%` <- function(lhs, rhs) rowSums(sapply(rhs, grepl, x = as.character(lhs), ignore.case = TRUE)) > 0

table_a$name %regexin% c("Jon", "Ale", "Pet")
# [1] FALSE FALSE FALSE FALSE  TRUE
subset(table_a, name %regexin% c("Jon", "Ale", "Pet"))
#    name color
# 5 Peter  pink

This doesn't help with your join, mind you, though it might be part of some related logic.


Data:

table_a <- structure(list(name = structure(c(2L, 1L, 5L, 3L, 4L), .Label = c("ALex", "John", "Kev", "Peter", "ToM"), class = "factor"), color = c("red", "blue", "green", "yellow", "pink")), row.names = c(NA, -5L), class = "data.frame")
table_b <- structure(list(name = structure(c(2L, 1L, 6L, 3L, 4L, 5L), .Label = c("Alexander", "Johnathan", "Kevin", "Luke", "Ryan", "Tomas"), class = "factor"), food = c("pizza", "tacos", "sushi", "cake", "brownies", "burgers")), row.names = c(NA, -6L), class = "data.frame")
table_c <- structure(list(name = structure(c(2L, 1L, 5L, 3L, 4L), .Label = c("Alexande1", "Johnatha", "Kevi1", "Luk1", "Toma1"), class = "factor"), food = c("pizza", "tacos", "sushi", "cake", "brownies")), row.names = c(NA, -5L ), class = "data.frame")
r2evans
  • 141,215
  • 6
  • 77
  • 149