0

Consider:

string = 'pizza'
matchings = ['pizzas', 'potato chips', 'cheesy lime', 'pretzels', 'pork']

I am trying to find a good way to find the best match in the list. which I am calculating with:

matchings_indices = {matching:sum([s == m for s,sdx in enumerate(string)\
                                 for m, mdx in enumerate(matching) if sdx<=mdx])/len(string) 
                     for matching in matchings}
matchings_indices

Which results in:

{'pizzas': 1.0,
 'potato chips': 0.6,
 'cheesy lime': 0.2,
 'pretzels': 0.6,
 'pork': 0.4}

Simple but good enough! I can pluck out the maximum value and that will be the match (I only need one matching value, calculated scores for clarity). But it really struggles when strings very similar appear in the list:

string = 'pizza'
matchings = ['pizzas', 'pizza fries', 'cheesy lime', 'pizzo', 'pizza']

Now my output becomes:

{'pizzas': 1.0,
 'pizza fries': 1.0,
 'cheesy lime': 0.2,
 'pizzo': 1.0,
 'pizza': 1.0}

Of course here pizza should have maximum index. I tried sorting them as well like:

matchings_indices = {matching:sum([s == m for s,sdx in enumerate(sorted(string))\
                                 for moose in matching.split() 
                                 for m, mdx in enumerate(sorted(moose)) if sdx==mdx])/len(string) 
                     for matching in matchings}

But in that case this is the output for first case: (Still good enough for very dissimilar strings)

{'pizzas': 0.8,
 'potato chips': 0.0,
 'cheesy lime': 0.0,
 'pretzels': 0.0,
 'pork': 0.2}

and here for second:

{'pizzas': 0.8,
 'pizza fries': 1.0,
 'cheesy lime': 0.2,
 'pizzo': 0.6,
 'pizza': 1.0}

Which is better but still. pizzas is a better match than pizza fries and should be scored higher.

So any help bettering the situation will be great!

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
Hamza
  • 5,373
  • 3
  • 28
  • 43

1 Answers1

1

You could have a look at using the edit distance/levenshtein distance. From the Wikipedia page:

the Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other.

I found this answer which calculates the distance, and then you could subtract this distance from 1 to make your maximum score the best:

# from https://stackoverflow.com/a/32558749/6386471
def levenshteinDistance(s1, s2):
    if len(s1) > len(s2):
        s1, s2 = s2, s1

    distances = range(len(s1) + 1)
    for i2, c2 in enumerate(s2):
        distances_ = [i2+1]
        for i1, c1 in enumerate(s1):
            if c1 == c2:
                distances_.append(distances[i1])
            else:
                distances_.append(1 + min((distances[i1], distances[i1 + 1], distances_[-1])))
        distances = distances_
    return distances[-1]

string = 'pizza'
matchings = ['pizzas', 'pizza fries', 'cheesy lime', 'pizzo', 'pizza']

scores = {}

for m in matchings:
    scores[m] = 1 - levenshteinDistance(string,m)

scores

>>> {'pizzas': 0, 'pizza fries': -5, 'cheesy lime': -10, 'pizzo': 0, 'pizza': 1}

import operator
max(scores.items(), key=operator.itemgetter(1))[0]

>>> 'pizza'
user6386471
  • 1,203
  • 1
  • 8
  • 17
  • Although very slow for large data structures like dataframes with 100k+ rows, its actually the best way to determine similarity! – Hamza Nov 22 '20 at 00:56