I have some data which looks like:
> df1
match1 match2 someData
1 AAPL Apple 100
2 GOOG Google Inc 105
3 MSFT Microsoft Inc 103
4 <NA> Nvidia 99
5 <NA> Amazon 101
6 INTC Intel 98
> df2
toMatch1 toMatch2 someOtherData
1 AAPL Apple 1000
2 INTC Intel 500
3 MSFT Microsoft 750
4 GOOG Google 1250
5 MMM 3M 1500
6 ATVI Activision 1000
7 AMZN Amazon 1750
8 NVDA Nvidia 2000
9 TSLA Tesla 500
10 FB Facebook 250
I want to join them together using an if
statement. In the df1
data, under the column match1
there are NA
values for two observations. I would like to create the if
statement such that:
if column exists in
match1
joinmatch1
withtoMatch1
using an "exact match"left_join
else join using
match2
with columntoMatch2
using a "fuzzy match"stringdist_join
I can join using:
left_join(df1, df2, by = c("match1" = "toMatch1"))
stringdist_join(df1, df2, by = c("match2" = "toMatch2"))
However when I use the second approach I obtain:
> stringdist_join(df1, df2, by = c("match2" = "toMatch2"))
# A tibble: 4 x 6
match1 match2 someData toMatch1 toMatch2 someOtherData
<fct> <fct> <dbl> <fct> <fct> <dbl>
1 AAPL Apple 100 AAPL Apple 1000
2 NA Nvidia 99 NVDA Nvidia 2000
3 NA Amazon 101 AMZN Amazon 1750
4 INTC Intel 98 INTC Intel 500
But Google Inc
in df1
should have been joined up with Google
in df2
and it hasn't. The same for Microsoft Inc
and Microsoft
. It seems that the stringdist_join
function is only looking for "exact matches" also...
My question is, I want to create an if
statement to first look at the match1
and toMatch1
columns and if there are values in these columns join the data on an exact match. Otherwise look at the match2
and toMatch2
column and try to join the data using fuzzyjoin
.
Data:
library(fuzzyjoin)
library(dplyr)
df1 <- data.frame(
match1 = c("AAPL", "GOOG", "MSFT", NA, NA, "INTC"),
match2 = c("Apple", "Google Inc", "Microsoft Inc", "Nvidia", "Amazon", "Intel"),
someData = c(100, 105, 103, 99, 101, 98)
)
df2 <- data.frame(
toMatch1 = c("AAPL", "INTC", "MSFT", "GOOG", "MMM", "ATVI", "AMZN", "NVDA", "TSLA", "FB"),
toMatch2 = c("Apple", "Intel", "Microsoft", "Google", "3M", "Activision", "Amazon", "Nvidia", "Tesla", "Facebook"),
someOtherData = c(1000, 500, 750, 1250, 1500, 1000, 1750, 2000, 500, 250)
)