-1

I'm trying to do fuzzy matching using pyspark or python, where I have 2 lists.

i. cities standard values list

Clarksburg 
Fremont 
San Leandro 
Albuquerque 
Columbus 
San Jose 
Martinez 
New York
Alhambra
Unknown
Las Vegas
Dublin
Niagara Falls

ii. wrongly spelled list of city names

Clarksburg 
Closed 10/97
Fre,Nont
Fremong
San L:Eandro
Albuquerue
Clmbs
Sanjse
Martinz
New Yrk
Alambra
00011
L Vegas
Vegas
Ssan jose
Nw Yrk
Colmbus
Klarkburg
Alburque
Dublin
Niegara F

Now I want to match the wrongly spelled city names with the list of the standard values and created another list with appropriate matching. I'm looking for below output

Clarksburg - Clarksburg
Closed 10/97 - Unknown
Fre,Nont - Fremont
Fremong - Fremont
San L:Eandro - San Leandro
Albuquerue - Albuquerque
Clmbs -Columbus
Sanjse - San Jose
Martinz - Martinez
New Yrk - New York
Alambra - Alhambra
00011 - Unknown
L Vegas - Las Vegas
Vegas - Las Vegas
Ssan jose - San Jose
Nw Yrk - New York
Colmbus - Columbus
Klarkburg - Clarksburg
Alburque - Albuquerque
Dublin - Dublin
Niegara F - Niagara Falls

Any help would really help me. Thanks in advance.

Panwen Wang
  • 3,573
  • 1
  • 18
  • 39

1 Answers1

0

Use fuzzywuzzy, and change threshold to meet your requirements:

from fuzzywuzzy import process

threshold = 40

matchlist = [x for x in """
Clarksburg
Fremont
San Leandro
Albuquerque
Columbus
San Jose
Martinez
New York
Alhambra
Unknown
Las Vegas
Dublin
Niagara Falls
""".splitlines() if x]

checklist = [x for x in """
Clarksburg
Closed 10/97
Fre,Nont
Fremong
San L:Eandro
Albuquerue
Clmbs
Sanjse
Martinz
New Yrk
Alambra
00011
L Vegas
Vegas
Ssan jose
Nw Yrk
Colmbus
Klarkburg
Alburque
Dublin
Niegara F
""".splitlines() if x]

for check in checklist:
    match = process.extractOne(check, matchlist)
    print(f"{check} - {match[0] if match[1] > threshold else 'Unknown'}")

This gives me:

Clarksburg - Clarksburg
Closed 10/97 - Unknown
Fre,Nont - Fremont
Fremong - Fremont
San L:Eandro - San Leandro
Albuquerue - Albuquerque
Clmbs - Columbus
Sanjse - San Jose
Martinz - Martinez
New Yrk - New York
Alambra - Alhambra
00011 - Unknown
L Vegas - Las Vegas
Vegas - Las Vegas
Ssan jose - San Jose
Nw Yrk - New York
Colmbus - Columbus
Klarkburg - Clarksburg
Alburque - Albuquerque
Dublin - Dublin
Niegara F - Niagara Falls
Mike Organek
  • 11,647
  • 3
  • 11
  • 26
  • How to set threshold? and why did we set it to 40 in our case? – Deepak Sanagapalli Jul 07 '20 at 21:20
  • @DeepakSanagapalli The value is between 0 and 100 based on the confidence of the match. I set it to 40 because `Closed 10/97` matches `Columbus` with a 34 confidence :-) – Mike Organek Jul 07 '20 at 21:23
  • Thanks @MikeOrganek, hope it will work for my requirement. Can you also help me in showing the match score beside each result? – Deepak Sanagapalli Jul 07 '20 at 21:50
  • @DeepakSanagapalli Use this in lieu of the existing `print()`: `print(f"{check} - {match[0] if match[1] > 0 else 'Unknown'} {match[1]}")` – Mike Organek Jul 07 '20 at 22:12
  • Instead of list can we also use csv or txt files for matchlist and checklist? if so what would be the change? – Deepak Sanagapalli Jul 07 '20 at 23:59
  • @DeepakSanagapalli You can populate the lists from files. Please see https://stackoverflow.com/questions/3277503/how-to-read-a-file-line-by-line-into-a-list – Mike Organek Jul 08 '20 at 12:43
  • Hi @Mike Organek, I have another question, i want to map the value as MULTICOLOR if there are multiple colors in the check list example: WHITE & Brown - MULTICOLOR BROWN BLACK - MULTICOLOR TAN/BLACK/RED - MULTICOLOR SLVR& BLK - MULTICOLOR – Deepak Sanagapalli Jul 27 '20 at 20:31
  • @DeepakSanagapalli I am sorry, but I do not think you are asking the right person. This question was about fuzzy matching city names. – Mike Organek Jul 27 '20 at 20:38