I am trying to create a program in R to replace city names or airport names with the three digit airport code. I want to do fuzzy matching to allow more flexibility since the data with the city/airport names I am trying to replace is coming in from many different sources. My problem is that with fuzzy match left joining I can't seem to find a way to only bring back the closest match to the left table (names) from the right table (codes).
For example: The city Augusta, GA was matched with both Augusta, GA and Augusta, ME and duplicated the data. I don't want to limit the max distance because I still want to allow for flexibility, but I can't have my data getting duplicated. I want to find a way to do partial string matching but to only bring back the closest result.
I have tried using the fuzzyjoin package, but from what I can see there is no way to limit only one match or only the best match. I know that in pmatch there is a call to prohibit duplicates but I can't find a way to make pmatch work as a join.
data <- stringdist_left_join(data, orig, ignore_case = TRUE)
This is the code I am using, the stringdist is a function of the fuzzyjoin package in R. The dataset "data" has city names, number of flights and other passenger info. The "orig" dataset has a column of city/airport names and the airport code
SAMPLE INPUT
**data table:**
City Name Passenger Name Fare Paid
Augusta, GA Jon $100
Dallas, TX Jane $200
Spokane, WA Chris $300
**orig table:**
City Name Code
Augusta, GA JCL
Dallas, TX DAL
Denver, CO DEN
Seattle, WA SEA
Spokane, WA GEG
Austin, TX AUS
Augusta, ME PEA
Portland, ME MEW
Portland, OR PDX
Desired Result
City Name Passenger Name Fare Paid Code
Augusta, GA Jon $100 JCL