6

I am trying to create a program in R to replace city names or airport names with the three digit airport code. I want to do fuzzy matching to allow more flexibility since the data with the city/airport names I am trying to replace is coming in from many different sources. My problem is that with fuzzy match left joining I can't seem to find a way to only bring back the closest match to the left table (names) from the right table (codes).

For example: The city Augusta, GA was matched with both Augusta, GA and Augusta, ME and duplicated the data. I don't want to limit the max distance because I still want to allow for flexibility, but I can't have my data getting duplicated. I want to find a way to do partial string matching but to only bring back the closest result.

I have tried using the fuzzyjoin package, but from what I can see there is no way to limit only one match or only the best match. I know that in pmatch there is a call to prohibit duplicates but I can't find a way to make pmatch work as a join.

data <- stringdist_left_join(data, orig, ignore_case = TRUE)

This is the code I am using, the stringdist is a function of the fuzzyjoin package in R. The dataset "data" has city names, number of flights and other passenger info. The "orig" dataset has a column of city/airport names and the airport code

SAMPLE INPUT
**data table:**
City Name       Passenger Name    Fare Paid
Augusta, GA         Jon            $100
Dallas, TX          Jane           $200
Spokane, WA         Chris          $300

**orig table:**
City Name      Code
Augusta, GA    JCL
Dallas, TX     DAL
Denver, CO     DEN
Seattle, WA    SEA
Spokane, WA    GEG
Austin, TX     AUS
Augusta, ME    PEA
Portland, ME   MEW
Portland, OR   PDX



Desired Result
City Name       Passenger Name    Fare Paid   Code
Augusta, GA         Jon            $100        JCL

sarahbarnes
  • 103
  • 2
  • 7
  • 1
    It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. – MrFlick Sep 12 '19 at 15:13
  • @MrFlick added a sample of both tables and what I want the result to be – sarahbarnes Sep 12 '19 at 15:24
  • One could suggest multiple algorithms to give a unique match. One suggestion is to use the minimum levenstein distance. If ties, use the mode. If ties, use first alphabetically... and so on. It devolves into a thought experiment. Perhaps non-uniqueness of fuzzy match is itself a finding. Even Google asks for clarification when needed. – AdamO Sep 12 '19 at 15:44

1 Answers1

2

It's hard to tell without a reprex, but generally, you can process your fuzzyjoin results with

%>% group_by(column_you_joined_by) %>% slice_min(dist)

to get the one "best" result.

References:

https://github.com/dgrtwo/fuzzyjoin/issues/18

How can I match fuzzy match strings from two datasets?

Arthur Yip
  • 5,810
  • 2
  • 31
  • 50