Compare similar spelling in Pandas dataframe column but different value in another column

Question

Let's say I have a Pandas dataframe in Python that looks something like this:

df_test = pd.DataFrame(data=None, columns=['file', 'number'])
df_test.file = ['washington_142', 'washington_287', 'chicago_453', 'chicago_221', 'chicago_345', 'seattle_976', 'seattle_977', 'boston_367', 'boston 098']
df_test.number = [20, 21, 33, 34, 33, 45, 45, 52, 52]

What I want to find out from this dataset are those strings in 'file' that start with the same exact letters (maybe 50% of the string at least), but that do not have the same corresponding value in the 'number' column. In this example, it means I would want to create a new dataframe that finds:

'washington_142', 'washington_287', 'chicago_453', 'chicago_221', 'chicago_345'

But none of the others since they have the same 'number' when the spelling starts with the same string. I know there is a function 'difflib.get_close_matches' but I am not sure how to implement it to check with the other column in the dataframe. Any advice or help is really appreciated!

score 2 · Accepted Answer · answered May 22 '23 at 13:20

You need to clarify your rule (how many letters? or how much fraction?)

Assuming you want a full match:

df['match'] = df['file'].str.extract('^([a-zA-Z]+)', expand=False)
df = df.groupby('match').filter( lambda _df : _df.number.nunique() > 1)
print(df['file'].unique())

score 1 · Answer 2 · answered May 22 '23 at 13:51

The answer by Learning is a mess is much more efficient if the letters in the file strings will fully match. If there are other differences in file other than numbers and _/ , then you might want to use fuzzywuzzy to match the similarity of the files:

from fuzzywuzzy import fuzz

# get all permutations
compare = pd.MultiIndex.from_product([df_test.file,
                                      df_test.file]).to_series()
# fuzzy match - see https://stackoverflow.com/a/54866372/18571565
def metrics(tup):
    return pd.Series([fuzz.ratio(*tup)],
                     ['ratio'])
compare = compare.apply(metrics)
compare = compare.loc[compare.ratio.ge(60)]  # chosen 60% minimum match here
# get list of non-matching numbers for "matched" files
non_matching_files = compare.loc[
    # convert 'compare' index to pd.DataFrame
    pd.DataFrame(compare.index.to_list()).replace(
        # replace all values in df with matching 'number'
        df_test.set_index("file")["number"].to_dict())\
        # calculat differences between two columns and find those not equal
        .diff(axis=1)[1].ne(0).to_list()]\
    # return the first column of the index (the 'grouped' column) as a list
    .index.get_level_values(0).to_list()
# filter df_test for 'file' in list
df_test = df_test[df_test.file.isin(non_matching_files)]

Compare similar spelling in Pandas dataframe column but different value in another column

2 Answers2