Consider a dataframe :
company | label
comp1 fashion
comp2 fashionitem
comp3 fashionable
comp4 auto
comp5 autoindustry
comp6 automobile
comp6 food
comp7 delivery
I want to clean-up the labels a bit, and I am using a string distance for that:
from difflib import SequenceMatcher
def distance(a, b):
return SequenceMatcher(None, a, b).ratio()
The question is, how can I write a function that applies the distance
function between any two elements on the label
column and, at the end, replaces all similar elements (distance above a certain threshold) with the shortest string?
The result should be something like:
company | label
comp1 fashion
comp2 fashion
comp3 fashion
comp4 auto
comp5 auto
comp6 auto
comp6 food
comp7 delivery
I am thinking of performing 2 for loops, but my dataset may be quite large, is there an efficient way of doing this?
EDIT: While reading the below replies, I realize I made a mistake. The overall number of entries (number of companies) is large, BUT the overall number of unique labels is small, less than 1000. One could apply df.label(unique)
and work with that, I guess.