0

I need to create some algorithm that will count the level of similarity of two strings with equal size.

For example we have one string with size 10. And several strings with size 10 that we should measure.

First has one part of 9 symbol similar (9 from 10)

Second has two parts of 7 and 2 symbols

Third has 3 parts for (4, 3, 1) symbols

4th 8 symbols

5th has one part with 6 symbols

I need some algorithm that will range all this strings for the level of similarity. As i understand the longer one part is the more similar are strings. But for example what is better one part with 8 elements or two with 7 and 2 elements. One part with 6 element or 3 parts with 4, 3, 1 segments that have 8 symbols at sum. Any advice?

P.S. guys, i dont need algoritm how tp compare strings, i need algoritm how to count difference, how to define similarity if i have already several common parts in two strings.

Initial string "i like apple"

  1. "apple i like" (apple i like)
  2. "i like appel" (i like app l e)
  3. "i like papel" (i like ap p l e)
  4. "i like pleap" (i like ap ple)
  5. "i like mango" (i like a)

It should be some math that count general length of string, length of parts that "cover" initial string and maybe some additio al parameter (if less parts - than better for similarity !!maybe!!)

  • 2
    Does this answer your question? [Find the similarity metric between two strings](https://stackoverflow.com/questions/17388213/find-the-similarity-metric-between-two-strings) – mkrieger1 May 30 '22 at 13:06
  • This does not seem to be the typical "minimum edit distance", e.g. [Levenshtein Distance](https://en.wikipedia.org/wiki/Levenshtein_distance). Please define in more detail what "similar" means in your case. You last paragraph seems to suggest that you don't know this yourself. If you want advise how to _define_ similarity, please show some actual sample strings. – tobias_k May 30 '22 at 13:08
  • Please provide enough code so others can better understand or reproduce the problem. – Community May 30 '22 at 13:32
  • @tobias_k tried – user3894579 May 30 '22 at 16:09
  • Interesting problem, but still rather under-specified. In particular, it's still not really clear if you seek advice how those different strings should be ranked in the first place, or if you just need help with the algorithm for finding those maximum common substrings. Also, the first batch of examples does not seem to match the second batch. – tobias_k May 30 '22 at 16:25
  • For the "how they should be ranked" part, maybe you can use the sum of squared length of matching substrings, e.g. `1+16+4+9` for `i like ap ple`; not sure how the order of substrings comes into play, though, i.e. whether `pleap` or `appel` should be more similar. – tobias_k May 30 '22 at 16:29
  • @tobias_k need advice how they should be ranked. Best way from 0 to 1 at final stage. Yes, they not match second batch. I tried create readable example. Seems ple ap should be more similar than app e l . Strange, but how to defence from such moment? – user3894579 May 30 '22 at 16:53
  • @tobias_k if we count sum of squared length we can choose the best match string between setnof others, but we cannot count general similarity, it would be variant that we choose best from very unsimilar strings – user3894579 May 30 '22 at 18:02

1 Answers1

0

you need to compare every char of the first string, with every char of the second string with something like this, adding the String in a dictionary with an integer to sum the similarities.

stringToCompare = 'ABCDE'
String1 = 'ABCDE'
String2 = 'ABCDF'
String3 = 'ABCKJ'
String4 = 'ABLMN'
if __name__== '__main__':
    Dict1 = {'string': String1, 'similarity': 0}
    Dict4 = {'string': String2, 'similarity': 0}
    Dict2 = {'string': String3, 'similarity': 0}
    Dict3 = {'string': String4, 'similarity': 0}
    dictList = [Dict4, Dict3, Dict2, Dict1]

    for N, dict in enumerate(dictList):
        for stringChar in dict['string']:
            for mainChar in stringToCompare:
                if stringChar == mainChar:
                    dict['similarity']+=1

    SORTED = sorted(dictList, key=lambda d: d['similarity'], reverse=True)

    print(SORTED)

On SORTED you have the list of the dictionaries Sorted by the key 'similarity'

i dunno how your strings are formatted, you need to automatize creation of 'dictList' to avoid doing it manually.

Hope it helps