11

Suppose I have a list:-

person_name = ['zakesh', 'oldman LLC', 'bikash', 'goldman LLC', 'zikash','rakesh']

I am trying to group the list in such a way so the Levenshtein distance between two strings is maximum. For finding out the ratio between two words, I am using a python package fuzzywuzzy.

Examples :-

>>> from fuzzywuzzy import fuzz
>>> combined_list = ['rakesh', 'zakesh', 'bikash', 'zikash', 'goldman LLC', 'oldman LLC']
>>> fuzz.ratio('goldman LLC', 'oldman LLC')
95
>>> fuzz.ratio('rakesh', 'zakesh')
83
>>> fuzz.ratio('bikash', 'zikash')
83
>>> 

My end goal:

My end goal is to group the words such that Levenshtein distance between them is more than 80 percent?

My list should look something like this :-

person_name = ['bikash', 'zikash', 'rakesh', 'zakesh', 'goldman LLC', 'oldman LLC'] because the distance between `bikash` and `zikash` is very high so they should be together.

Code:

I am trying to achieve this by sorting but key function should be fuzz.ratio. Well below code is not working, but I am approaching the problem in this angle.

from fuzzywuzzy import fuzz
combined_list = ['rakesh', 'zakesh', 'bikash', 'zikash', 'goldman LLC', 'oldman LLC']
combined_list.sort(key=lambda x, y: fuzz.ratio(x, y))
print combined_list

Could anyone help me to combine the words so that Levenshtein distance between them is more than 80 percent?

python
  • 4,403
  • 13
  • 56
  • 103
  • "combine", "similar", "very high" and "something like this" are all not well defined. You have to clarify your specifications. – timgeb Feb 03 '16 at 08:17
  • I have edited my question. Group all words such that the `Levenshtein distance` between them is more than `80%` – python Feb 03 '16 at 08:19
  • And if that's not possible for a given list of words, what should then happen? – timgeb Feb 03 '16 at 08:19
  • Then just do simple sorting – python Feb 03 '16 at 08:21
  • 4
    I'm confused. I see an optimization problem, i.e. "split this set of words into clusters with maximum internal distance", yet you continue to talk about "sorting", which is "putting the list into order". I'm not sure how those two mix. Do you only want the distance between two *successive* strings to be maximum? – dhke Feb 03 '16 at 08:23
  • 2
    In python 2 you can pass a comparison function `cmp` to `sort` (and related functions). That has been removed from Python 3, but there _is_ a workaround: see [cmp_to_key](https://docs.python.org/3/library/functools.html#functools.cmp_to_key). – PM 2Ring Feb 03 '16 at 08:26
  • I am trying to group all words such that the `Levenshtein distance` between them is more than 80 percent? I was initially thinking about sorting, but if you could suggest any other approach I can try that – python Feb 03 '16 at 08:26
  • 1
    As for the clustering problem (if it is one): Sounds like [maximum clique](https://en.wikipedia.org/wiki/Clique_problem#Finding_maximum_cliques_in_arbitrary_graphs). The nodes are the words and you have an edge between words, if the edit distance is >= 80% between those words. The rest is NP-complete, take an approximation algorithm of your choosing, [networkx](https://networkx.github.io/documentation/latest/reference/algorithms.clique.html) implements some. – dhke Feb 03 '16 at 08:34
  • I have deleted my previous comment! I don't think `cmp_to_key` is working for me. Could anyone suggest some approaches :/ :/ – python Feb 03 '16 at 08:38
  • @dhke Can we use the `group by` function in python to achieve the result ? – python Feb 03 '16 at 08:51
  • 1
    @python For *group by*, you need a function that determines the group association of a word based on the word itself. Problem here is: If you can add a word to a group depends on *all* the elements that are already in the group. It's a combinatorial optimization problem and as far as I can tell, maximum clique fits the bill. – dhke Feb 03 '16 at 09:17

1 Answers1

13

This groups the names

from fuzzywuzzy import fuzz

combined_list = ['rakesh', 'zakesh', 'bikash', 'zikash', 'goldman LLC', 'oldman LLC']
combined_list.append('bakesh')
print('input names:', combined_list)

grs = list() # groups of names with distance > 80
for name in combined_list:
    for g in grs:
        if all(fuzz.ratio(name, w) > 80 for w in g):
            g.append(name)
            break
    else:
        grs.append([name, ])

print('output groups:', grs)
outlist = [el for g in grs for el in g]
print('output list:', outlist)

producing

input names: ['rakesh', 'zakesh', 'bikash', 'zikash', 'goldman LLC', 'oldman LLC', 'bakesh']
output groups: [['rakesh', 'zakesh', 'bakesh'], ['bikash', 'zikash'], ['goldman LLC', 'oldman LLC']]
output list: ['rakesh', 'zakesh', 'bakesh', 'bikash', 'zikash', 'goldman LLC', 'oldman LLC']

As you can see, the names are grouped correctly, but the order may not be the one you desire.

Pynchia
  • 10,996
  • 5
  • 34
  • 43
  • 1
    Isn't this only one (greedy) approach and the resulting groups may not be optimal? – dhke Feb 03 '16 at 12:05
  • @dhke yes of course. I never said it's optimal nor the only one. Please do provide a better/alternative solution. – Pynchia Feb 03 '16 at 12:19
  • I just wanted to mention it, but I'll try to give an optimal (maximum groups) when I get around. Nonetheless, your solution is of course correct and returns valid clusters. – dhke Feb 03 '16 at 12:37
  • 1
    This is essentially clustering with a single cluster, and incrementally adding elements. Note that it's order-dependent; as we add more elements to the cluster (and they become spread out), it becomes harder to satisfy the 'add' condtions `all(fuzz.ratio(name, w) > 80 for w in g)`. So if you shuffled `combined_list` before running, you could get different results. – smci Mar 12 '21 at 23:28