1

Consider a dataframe :

company  |  label 

  comp1       fashion
  comp2       fashionitem
  comp3       fashionable
  comp4       auto
  comp5       autoindustry
  comp6       automobile
  comp6       food
  comp7       delivery

I want to clean-up the labels a bit, and I am using a string distance for that:

from difflib import SequenceMatcher

def distance(a, b):
    return SequenceMatcher(None, a, b).ratio()

The question is, how can I write a function that applies the distance function between any two elements on the label column and, at the end, replaces all similar elements (distance above a certain threshold) with the shortest string?

The result should be something like:

company  |  label 

  comp1       fashion
  comp2       fashion
  comp3       fashion
  comp4       auto
  comp5       auto
  comp6       auto
  comp6       food
  comp7       delivery

I am thinking of performing 2 for loops, but my dataset may be quite large, is there an efficient way of doing this?

EDIT: While reading the below replies, I realize I made a mistake. The overall number of entries (number of companies) is large, BUT the overall number of unique labels is small, less than 1000. One could apply df.label(unique) and work with that, I guess.

Qubix
  • 4,161
  • 7
  • 36
  • 73
  • How large is your dataset? – Erfan Nov 20 '19 at 09:30
  • @Erfan right now it "only" has around 500k entries, but I think it could go to maybe 2-3 million entries in total. – Qubix Nov 20 '19 at 09:41
  • Are you willing to accept approximate solution, the exact solution would take 500_000 ^ 2 operations at least. Does the similar elements always share a common prefix? – Dani Mesejo Nov 20 '19 at 09:44
  • @DanielMesejo, I would love to have an exact function, that works for a small number of different labels as well ( < 1000). I made a correction in the question above, as I realize the number of different labels is small. – Qubix Nov 20 '19 at 09:47
  • In your above example fashion, fashionable, fashionitem form a cluster therefore all of them must be replaced by the one shortest of them? – Dani Mesejo Nov 20 '19 at 10:01
  • @DanielMesejo yep. – Qubix Nov 20 '19 at 10:06
  • [This fuzzy_merge](https://stackoverflow.com/questions/13636848/is-it-possible-to-do-fuzzy-match-merge-with-python-pandas/56315491#56315491) function I wrote can help you. I came quite close, but couldn't get your exact expected output. Start of with a self join `fuzzy_merge(df, df, 'label', 'label', threshold=50, limit=5)` and proceed from there. Also play around with the `threshold` argument – Erfan Nov 20 '19 at 10:26

1 Answers1

1

Idea

The idea for this approach is to build an adjacency matrix from the threshold (ed) ratio matrix. From the adjacency matrix build a graph, and from it get the connected components (clusters). Getting to the desired output can be tricky, but it can be achieved with an (absolute) threshold of 0.49.

Setup

from difflib import SequenceMatcher

import networkx as nx
import numpy as np
import pandas as pd

df = pd.DataFrame(data=[['comp1', 'fashion'],
                        ['comp2', 'fashionitem'],
                        ['comp3', 'fashionable'],
                        ['comp4', 'auto'],
                        ['comp5', 'autoindustry'],
                        ['comp6', 'automobile'],
                        ['comp6', 'food'],
                        ['comp7', 'delivery']], columns=['company', 'label'])


def distance(a, b):
    return SequenceMatcher(None, a, b).ratio()

Code

# get unique labels
labels = df['label'].unique()

# compute ratios
result = np.array([[distance(li, lj) for lj in labels] for li in labels])

# set diagonal to zero
result[np.arange(8), np.arange(8)] = 0

# build adjacency matrix
adjacency_matrix = (result > 0.49).astype(int)

# create graph
dg = nx.from_numpy_array(adjacency_matrix, create_using=nx.Graph)

# create mapping dictionary from connected components
mapping = {}
for component in nx.connected_components(dg):
    group = labels[np.array(list(component))]
    value = min(group, key=len)
    mapping.update({label: value for label in group})

result = df.assign(label=df.label.map(mapping))

print(result)

Output

  company     label
0   comp1   fashion
1   comp2   fashion
2   comp3   fashion
3   comp4      auto
4   comp5      auto
5   comp6      auto
6   comp6      food
7   comp7  delivery
Dani Mesejo
  • 61,499
  • 6
  • 49
  • 76