2

I have two datasets that I want to join on book column. However, the book names are a bit different from the first dataset. It seems there is no clear pattern, but the only condition would be this: the book names in df1 is a subset of the book names df2. Would you have any suggestion to join them?

df1 <- tribble(
  ~book, ~sales, 
  "a",   100,
  "bakora",    60,
  "c",  40,
)
df2 <- tribble(
  ~book, ~sales, 
  "a.com",   100,
  "bakorade",    60,
  "c.zi",    60,
)
datazang
  • 989
  • 1
  • 7
  • 20
  • 1
    Depends on how you want to match. In the first data, you have one letter 'a', that should match to 'a.com' e.g. `stringdist_inner_join(df1, df2, by = 'book')` returns the second row based on distance methods – akrun Mar 01 '21 at 22:30
  • Do you think there would be a way to build a condition like if the book name in df1 is in df2, join them? @akrun – datazang Mar 01 '21 at 22:32
  • 1
    In your example, all 3 in book are some way matching i.e. a with a.com, bakora with bakorade and c with c.zi – akrun Mar 01 '21 at 22:33
  • Thanks for the fuzzyjoin library, but it gave the zero overlap. @akrun – datazang Mar 01 '21 at 22:35
  • 1
    For me the `stringdist_inner_join` is returning the second row, but you can always make some changes with `max_dist` – akrun Mar 01 '21 at 22:36
  • This is great! Thanks a lot @akrun – datazang Mar 01 '21 at 22:39

1 Answers1

2

We can use stringdist

library(fuzzyjoin)
stringdist_inner_join(df1, df2, by = 'book')
akrun
  • 874,273
  • 37
  • 540
  • 662