-3

How to identify all the variation of a word in a column_one, and then fill a value in other column, , columns_two, whenever a variation of that word is found?

E.g. Fill column value with P, whenever a variation of "PHIADELPHIA" is found, and fill with I, whenever a variation of "ILLINOIS" if found.

place value
PHIADELPHIA
PHIALDELPHIA
PHIDELPHIA
illinois
PHIELADELPHIA
PHIILADELPHIA
illinoi
PHILA
PHILA.
PHILAD
PHILADALPHIA
PHILADELPHIA
PHILADELAPHIA
PHILADELHIA
PHILADELHPIA
PHILADELLPHIA
PHILADELPHIA
PHILADELPH
PHILADELPHA
PHILADELPHAI
PHILADELPHI
PHILADELPHIA

Fuzzy Matching, Levenshtein distance, etc

Input String:

import pandas as pd
import numpy as np

place = ['PHIADELPHIA','PHIALDELPHIA','PHIDELPHIA','illinois','PHIELADELPHIA','PHIILADELPHIA','illinoi','PHILA','PHILA.','PHILAD','PHILADALPHIA','PHILADELPHIA','PHILADELAPHIA','PHILADELHIA','PHILADELHPIA','PHILADELLPHIA','PHILADELPHIA','PHILADELPH','PHILADELPHA','PHILADELPHAI','PHILADELPHI','PHILADELPHIA']
value=[np.nan]*len(place)
df = pd.DataFrame(zip(place,value), columns=["place", "value"])
df
  • 1
    I see you have fuzzywuzzy in your tags. Have you tried it? – Captain Caveman May 17 '23 at 21:21
  • I have checked `fuzzywuzzy`, however, need help in filling the vlaues in `value` column, whenever a variation of word1 or word2 is encountered. How to implement that logic is the main concern @CaptainCaveman – fast_crawler May 17 '23 at 21:21
  • Does something like this help? `df.loc[df["place"].isin(["PHIADELPHIA", "PHILA"]), "value"] = "Philadelphia"`. The list should have all possibilities you found for Philadelphia. Also, you can refer to [here](https://stackoverflow.com/questions/60987641/check-if-there-is-a-similar-string-in-the-same-column) – Paulo Marques May 17 '23 at 22:02

1 Answers1

0

A solution using fuzzywuzzy

from fuzzywuzzy import fuzz

threshold = 50
df['value'] = df['place'].apply(lambda x: 'P' if fuzz.token_set_ratio(x, 'Philadelphia') >= threshold else 'I' if fuzz.token_set_ratio(x, 'ILLINOIS') >= threshold else None)
PARAK
  • 130
  • 8