0

I have some data which looks like:

> df1
  match1        match2 someData
1   AAPL         Apple      100
2   GOOG    Google Inc      105
3   MSFT Microsoft Inc      103
4   <NA>        Nvidia       99
5   <NA>        Amazon      101
6   INTC         Intel       98
> df2
   toMatch1   toMatch2 someOtherData
1      AAPL      Apple          1000
2      INTC      Intel           500
3      MSFT  Microsoft           750
4      GOOG     Google          1250
5       MMM         3M          1500
6      ATVI Activision          1000
7      AMZN     Amazon          1750
8      NVDA     Nvidia          2000
9      TSLA      Tesla           500
10       FB   Facebook           250

I want to join them together using an if statement. In the df1 data, under the column match1 there are NA values for two observations. I would like to create the if statement such that:

if column exists in match1 join match1 with toMatch1 using an "exact match" left_join

else join using match2 with column toMatch2 using a "fuzzy match" stringdist_join

I can join using:

left_join(df1, df2, by = c("match1" = "toMatch1"))
stringdist_join(df1, df2, by = c("match2" = "toMatch2"))

However when I use the second approach I obtain:

> stringdist_join(df1, df2, by = c("match2" = "toMatch2"))
# A tibble: 4 x 6
  match1 match2 someData toMatch1 toMatch2 someOtherData
  <fct>  <fct>     <dbl> <fct>    <fct>            <dbl>
1 AAPL   Apple       100 AAPL     Apple             1000
2 NA     Nvidia       99 NVDA     Nvidia            2000
3 NA     Amazon      101 AMZN     Amazon            1750
4 INTC   Intel        98 INTC     Intel              500

But Google Inc in df1 should have been joined up with Google in df2 and it hasn't. The same for Microsoft Inc and Microsoft. It seems that the stringdist_join function is only looking for "exact matches" also...

My question is, I want to create an if statement to first look at the match1 and toMatch1 columns and if there are values in these columns join the data on an exact match. Otherwise look at the match2 and toMatch2 column and try to join the data using fuzzyjoin.

Data:

library(fuzzyjoin)
library(dplyr)

df1 <- data.frame(
  match1 = c("AAPL", "GOOG", "MSFT", NA, NA, "INTC"),
  match2 = c("Apple", "Google Inc", "Microsoft Inc", "Nvidia", "Amazon", "Intel"),
  someData = c(100, 105, 103, 99, 101, 98)
)

df2 <- data.frame(
  toMatch1 = c("AAPL", "INTC", "MSFT", "GOOG", "MMM", "ATVI", "AMZN", "NVDA", "TSLA", "FB"),
  toMatch2 = c("Apple", "Intel", "Microsoft", "Google", "3M", "Activision", "Amazon", "Nvidia", "Tesla", "Facebook"),
  someOtherData = c(1000, 500, 750, 1250, 1500, 1000, 1750, 2000, 500, 250)
)
user113156
  • 6,761
  • 5
  • 35
  • 81

1 Answers1

2

You need to adjust the max_dist parameter in stringdist_join

library(fuzzyjoin)
library(dplyr)
ans1 <- inner_join(df1, df2, by = c("match1" = "toMatch1"))
ans2 <- stringdist_join(df1, df2, by = c("match2" = "toMatch2"), max_dist  = 4)
ans2

# A tibble: 10 x 6
#   match1 match2        someData toMatch1 toMatch2  someOtherData
#   <chr>  <chr>            <dbl> <chr>    <chr>             <dbl>
# 1 AAPL   Apple              100 AAPL     Apple              1000
# 2 AAPL   Apple              100 INTC     Intel               500
# 3 AAPL   Apple              100 GOOG     Google             1250
# 4 AAPL   Apple              100 TSLA     Tesla               500
# 5 GOOG   Google Inc         105 GOOG     Google             1250
# 6 MSFT   Microsoft Inc      103 MSFT     Microsoft           750
# 7 NA     Nvidia              99 NVDA     Nvidia             2000
# 8 NA     Amazon             101 AMZN     Amazon             1750
# 9 INTC   Intel               98 AAPL     Apple              1000
#10 INTC   Intel               98 INTC     Intel               500

To give preference to ans1 you can then use anti_join to remove match1 values in ans2 which are already present in ans1.

Ronak Shah
  • 377,200
  • 20
  • 156
  • 213