Identifying common elements in a list of words

Question

I have list of words in a column where I need to find common elements. For example, list contains words such as,

sinazz31 sinazz12 45sinazz sinazz_84

As you can see, the common element is “sinazz”. Is there a way to develop an algorithm in Python to identify such common elements? If the length of the words are less than 4, the words can be ignored.

Potential duplicate: https://stackoverflow.com/questions/58585052/find-most-common-substring-in-a-list-of-strings — Liam, Mar 11 '22 at 07:05

Liam · Answer 1 · 2022-03-11T07:10:41.253

Have a look at this similar question: (Find most common substring in a list of strings?)

I added in the condition that it won't match the word if the length is less than 4

from difflib import SequenceMatcher
substring_counts={}
list = ['sinazz31', 'sinazz12', '45sinazz', 'sinazz_84']

for i in range(0, len(list)):
    for j in range(i+1,len(list)):
        string1 = list[i]
        string2 = list[j]
        match = SequenceMatcher(None, string1, string2).find_longest_match(0, len(string1), 0, len(string2))
        matching_substring=string1[match.a:match.a+match.size]
        if(matching_substring not in substring_counts and len(matching_substring) > 3):
            substring_counts[matching_substring]=1
        else:
            substring_counts[matching_substring]+=1

print(substring_counts)

score 0 · Accepted Answer · answered Mar 11 '22 at 07:23

You could search for substrings contained in all of the source strings. Starting with the length of the shortest string and going down from there:

string = 'sinazz31 sinazz12 45sinazz sinazz_84'
min_substring_length = 3

words = string.split()
longest_word = max(filter(None, words), key=len)
matches = {}

for sub_length in range(len(longest_word), min_substring_length - 1, -1):
    for x in range(len(longest_word) - sub_length):
            substring = longest_word[(0 + x):(sub_length + x)] # create substring to check
            check = len([1 for word in words if (substring in word)]) # number of words containing substring
            if check > 1:
                matches[substring] = check # number of words containing substring

# results
if matches:
    match_list = list(sorted(matches,key=matches.get,reverse=True)) # list of matches by frequency

    if matches[match_list[0]] == len(words): # prints substring if matches all words
        print('best match for all words:',match_list[0])
    print('best to worst:',match_list)

Identifying common elements in a list of words

2 Answers2