2

I am trying to left-join df2 onto df1.

df1 is my dataframe of interest, df2 contains additional information I need.

Example:

#df of interest onto which the other should be joined
key1 <- c("London", "Paris", "Berlin", "Delhi") 
other_stuff <- c("Tea", "Coffee", "Beer", "Tea") 
df1 <- data.frame(key1, other_stuff)

#additional info df
key2 <- c("London and other cities", "some other city", "Eastberlin is history", "Berlin", "Delia is a name", "Delhi is a place") 
more_info <- c("history", "languages", "trades", "art", "commerce", "manufacturing")
df2 <- data.frame(key2,more_info)

What I now want is that df2$key2 is searched for the precise occurrence of df1$key1 and then merged onto df1 (e.g. match Berlin to Berlin, but not to Eastberlin, and Delhi to Delhi but not to Delia) while ignoring the other words surrounding the match.

Desired outcome:

key1 other_stuff more_info
London Tea history
Paris Coffee NA
Berlin Beer art
Delhi Tea manufacturing

I tried variations of regex_left_join joined<- regex_left_join(df1,df2, by = c("key1" = "key2"), ignore_case= F)

and fuzzyjoins joined<- df1%>% fuzzy_left_join(df2, by = c("key1" = "key2"), match_fun = str_detect)

They both only give a result for the exact match (key1=key2=Berlin) and give NA for everything else.

How do I do this?

I also tried Merging two tables where one column is substring of the other in R but the logic in the SQL there is the wrong way around. I tried several other Stackexchange approaches, but they are "too fuzzy" for my data.

Konrad Rudolph
  • 530,221
  • 131
  • 937
  • 1,214
Auream
  • 55
  • 4

3 Answers3

1

Here I use the "regular" dplyr::left_join, but performed some selection in df2 when joining it with df1.

First create a vector that contains your target city. Then I will split df2$key2 by white space, and see if there is any word that matches the strings in the vector city. Then left_join it with df1.

library(tidyverse)

city <- c("London", "Paris", "Berlin", "Delhi")

left_join(df1,
          df2 %>% mutate(city = sapply(strsplit(df2$key2, " "), 
                                       function(x) first(intersect(city, x)))),
          by = c("key1" = "city")) %>% 
  select(-key2)

    key1 other_stuff     more_info
1 London         Tea       history
2  Paris      Coffee          <NA>
3 Berlin        Beer           art
4  Delhi         Tea manufacturing
benson23
  • 16,369
  • 9
  • 19
  • 38
  • I tried your solution, however, it gives me the following error message: "Error in (function (cond) : error in evaluating the argument 'x' in selecting a method for function 'select': Problem with `mutate()` column `city`. i `city = sapply(...)`. x unable to find an inherited method for function ‘first’ for signature ‘"character"’ Caused by error: ! unable to find an inherited method for function ‘first’ for signature ‘"character"’ " – Auream Feb 16 '22 at 20:23
  • I'm not sure what's wrong, try importing all functions from `dplyr` (e.g. `dplyr::first`) – benson23 Feb 17 '22 at 06:05
0

The following works with the posted data examples but it uses two joins and is probably ineffective for larger data sets.

library(dplyr)
library(fuzzyjoin)

left_join(
  df1,
  regex_left_join(df2, df1, by = c(key2 = "key1"))[c(3, 4, 2)] |> na.omit()
)
#> Joining, by = c("key1", "other_stuff")
#>     key1 other_stuff     more_info
#> 1 London         Tea       history
#> 2  Paris      Coffee          <NA>
#> 3 Berlin        Beer           art
#> 4  Delhi         Tea manufacturing

Created on 2022-02-16 by the reprex package (v2.0.1)

Rui Barradas
  • 70,273
  • 8
  • 34
  • 66
0

You don't get the expected result because those functions passes the second dataframe as the regex pattern, so instead you could use regex_right_join or fuzzy_right_join:

df1 %>% 
  regex_right_join(df2, ., by = c(key2 = "key1")) %>% 
  select(key1, other_stuff, more_info)

df1 %>% 
  fuzzy_right_join(df2, ., by = c(key2 = "key1"), match_fun = str_detect) %>% 
  select(key1, other_stuff, more_info)

output

    key1 other_stuff     more_info
1 London         Tea       history
2  Paris      Coffee          <NA>
3 Berlin        Beer           art
4  Delhi         Tea manufacturing
Maël
  • 45,206
  • 3
  • 29
  • 67