0

I want to get what is similar in a few strings. For example, I have 6 strings:

HELLO3456
helf04g
hell0r
h31l0

I want to get what is similar in these strings, for example in this case I would like it to tell me something like:

h is always at the start

That example is pretty simple and I can figure that out in my head but with something like:

61TvA2dNwxNxmWziZxKzR5aO9tFD00Nj
pHHlgpFt8Ka3Stb5UlTxcaEwciOeF2QM
fW9K4luEx65RscfUiPDakiqp15jiK5f6
17xz7MYEBoXLPoi8RdqbgkPwTV2T2H0y
Jvt0B5uZIDPJ5pbCqMo12CqD7pdnMSEd
n7voYT0TVVzZGVSLaQNRnnkkWgVqxA3b

it's not that easy. I have seen and tried:

to name a few but they're all not what I'm looking for. They give a value of how similar they are and I need to know what is similar in them.

I want to know if this is even possible and if so, how I can do it. Thank you in advance.

Etile
  • 70
  • 3
  • 2
    I think you need to have a precise definition of whet "similarity" means to you. And then, you'll get a better idea of whet to code. For instance, in your 2nd example, is "all strings are of the same length" acceptable as a similarity ? – Joseph Budin Dec 04 '20 at 13:53
  • Do you want a character to character similarity like you said "h is always at the start" another example should be like "l is always at the third place"? or one or more character and at any location in the string? – halcyoona Dec 04 '20 at 13:57
  • @JosephBudin Yeah, I should probably do that. I meant 'similarity' as the content of the strings for example: 1. same characters in the same place 2. only lowercase/uppercase/special characters or numbers in one place, etc. – Etile Dec 04 '20 at 13:59

2 Answers2

1

Minimal Solution

You are on the correct soltuion path with the difflib Library. I just picked the first two examples from your question to create a minimal Solution.

from difflib import SequenceMatcher


a = "61TvA2dNwxNxmWziZxKzR5aO9tFD00Nj"
b = "pHHlgpFt8Ka3Stb5UlTxcaEwciOeF2QM"

Sequencer = SequenceMatcher(None, a, b)

print(Sequencer.ratio())
matches = Sequencer.get_matching_blocks()
print(matches)

for match in matches:
    idx_a = match.a
    idx_b = match.b
    
    if not (idx_a == len(a) or idx_b == len(b)):
        print(30*'-' + 'Found Match' + 30*'-')
        print('found at idx {} of str "a" and at idx {} of str "b" the value {}'.format(idx_a, idx_b, a[idx_a]))

Output:

0.0625
[Match(a=2, b=18, size=1), Match(a=5, b=29, size=1), Match(a=32, b=32, size=0)]
------------------------------Found Match------------------------------
found at idx 2 of str "a" and at idx 18 of str "b" the value T
------------------------------Found Match------------------------------
found at idx 5 of str "a" and at idx 29 of str "b" the value 2

Explanation

I just used the ratio() to see if any similarity is existing. The function get_matching_blocks() returns a list with all matches in your string sequence. My minimal Solution doesn't care for same position, but this should be an easy fix with checking the indices. In the Situation that the return value of ratio() is rqual to 0.0 the matcher does not generate an empty list. The list contains always a match for the end of Sequence. I worked around with checking against length of the sequence with the matching idices. Another solution is to use only matches with a size > 0, as shown below:

if match.size > 0:
   ...

My Example also doesn't handle matches with size > 1. I think you will figure out to handle this problem ;)

MaKaNu
  • 762
  • 8
  • 25
0

I think this should be your desire solution. I have added "a" at the start of every string because otherwise there is no similarity in the strings you mentioned.

lst = ["A61TvA2dNwxNxmWziZxKzR5aO9tFD00Nj","apHHlgpFt8Ka3Stb5UlTxcaEwciOeF2QM","afW9K4luEx65RscfUiPDakiqp15jiK5f6","a17xz7MYEBoXLPoi8RdqbgkPwTV2T2H0y", "aJvt0B5uZIDPJ5pbCqMo12CqD7pdnMSEd","an7voYT0TVVzZGVSLaQNRnnkkWgVqxA3b"]
total_strings = len(lst)
string_length = len(lst[0])
for i in range(total_strings):
    lst[i] = lst[i].lower()

for i in range(string_length):
    flag = 0
    lst_char = lst[total_strings-1][i]
    for j in range(total_strings-1):
        if lst[j][i] == lst_char:
            flag = 1
            continue
        else:
            flag = 0
            break
    if flag == 1:
        print(lst[total_strings-1][i]+" is always at position "+str(i))
halcyoona
  • 369
  • 2
  • 6