Find most common substring in a list of strings?

Question

I have a Python list of string names where I would like to remove a common substring from all of the names.

And after reading this similar answer I could almost achieve the desired result using SequenceMatcher.

But only when all items have a common substring:

From List:
string 1 = myKey_apples
string 2 = myKey_appleses
string 3 = myKey_oranges

common substring = "myKey_"

To List:
string 1 = apples
string 2 = appleses
string 3 = oranges

However I have a slightly noisy list that contains a few scattered items that don't fit the same naming convention.

I would like to remove the "most common" substring from the majority:

From List:
string 1 = myKey_apples
string 2 = myKey_appleses
string 3 = myKey_oranges
string 4 = foo
string 5 = myKey_Banannas

common substring = ""

To List:
string 1 = apples
string 2 = appleses
string 3 = oranges
string 4 = foo
string 5 = Banannas

I need a way to match the "myKey_" substring so I can remove it from all names.

But when I use the SequenceMatcher the item "foo" causes the "longest match" to be equal to blank "".

I think the only way to solve this is to find the "most common substring". But how could that be accomplished?

Basic example code:

from difflib import SequenceMatcher

names = ["myKey_apples",
"myKey_appleses",
"myKey_oranges",
#"foo",
"myKey_Banannas"]

string2 = names[0]
for i in range(1, len(names)):
    string1 = string2
    string2 = names[i]
    match = SequenceMatcher(None, string1, string2).find_longest_match(0, len(string1), 0, len(string2))

print(string1[match.a: match.a + match.size]) # -> myKey_

I think that you want to not actually remove the most common substring. It will usually be only one letter. For example, all of your strings contain the letter `s` (e.g. `apples` ends in `s`). There are two numbers of interest: length of the sub-string and percentage of the search space containing the sub-string. As you increase one, you decrease the other. — Toothpick Anemone, Oct 28 '19 at 05:54
I think this would help: https://en.wikipedia.org/wiki/Longest_common_substring_problem — Jethro Cao, Oct 28 '19 at 05:56

score 12 · Accepted Answer · answered Oct 28 '19 at 05:56

Given names = ["myKey_apples", "myKey_appleses", "myKey_oranges", "foo", "myKey_Banannas"]

An O(n^2) solution I can think of is to find all possible substrings and storing them in a dictionary with the number of times they occur :

substring_counts={}

for i in range(0, len(names)):
    for j in range(i+1,len(names)):
        string1 = names[i]
        string2 = names[j]
        match = SequenceMatcher(None, string1, string2).find_longest_match(0, len(string1), 0, len(string2))
        matching_substring=string1[match.a:match.a+match.size]
        if(matching_substring not in substring_counts):
            substring_counts[matching_substring]=1
        else:
            substring_counts[matching_substring]+=1

print(substring_counts) #{'myKey_': 5, 'myKey_apples': 1, 'o': 1, '': 3}

And then picking the maximum occurring substring

import operator
max_occurring_substring=max(substring_counts.iteritems(), key=operator.itemgetter(1))[0]
print(max_occurring_substring) #myKey_

Now that's heavenly stuff! Even worked with this crazy list: `["sti_myKey_123","sti_myKey_233","stimm_myKey_676","sti_myKey_879","foo","sti_myKey_2345","sti_myKey_test3","ti_myKey_123"]` Result="sti_myKey_". Many thanks :) — Logic1, Oct 28 '19 at 07:10
Others may also want to look at SequenceMatcher.get_matching_blocks https://www.kite.com/python/docs/difflib.SequenceMatcher.get_matching_blocks — Under-qualified NASA Intern, Feb 24 '22 at 21:00

score 1 · Answer 2 · answered Oct 28 '19 at 05:54

Here's a overly verbose solution to your problem:

def find_matching_key(list_in, max_key_only = True):
  """
  returns the longest matching key in the list * with the highest frequency
  """
  keys = {}
  curr_key = ''

  # If n does not exceed max_n, don't bother adding
  max_n = 0

  for word in list(set(list_in)): #get unique values to speed up
    for i in range(len(word)):
      # Look up the whole word, then one less letter, sequentially
      curr_key = word[0:len(word)-i]
      # if not in, count occurance
      if curr_key not in keys.keys() and curr_key!='':
        n = 0
        for word2 in list_in:
          if curr_key in word2:
            n+=1
        # if large n, Add to dictionary
        if n > max_n:
          max_n = n
          keys[curr_key] = n
    # Finish the word
  # Finish for loop  
  if max_key_only:
    return max(keys, key=keys.get)
  else:
    return keys    

# Create your "from list"
From_List = [
             "myKey_apples",
             "myKey_appleses",
             "myKey_oranges",
             "foo",
             "myKey_Banannas"
]

# Use the function
key = find_matching_key(From_List, True)

# Iterate over your list, replacing values
new_From_List = [x.replace(key,'') for x in From_List]

print(new_From_List)
['apples', 'appleses', 'oranges', 'foo', 'Banannas']

Needless to say, this solution would look a lot neater with recursion. Thought I'd sketch out a rough dynamic programming solution for you though.

RoyM · Answer 3 · 2019-10-28T06:27:05.523

I would first find the starting letter with the most occurrences. Then I would take each word having that starting letter, and take while all these words have matching letters. Then in the end I would remove the prefix that was found from each starting word:

from collections import Counter
from itertools import takewhile

strings = ["myKey_apples", "myKey_appleses", "myKey_oranges", "berries"]

def remove_mc_prefix(words):
    cnt = Counter()
    for word in words:
        cnt[word[0]] += 1
    first_letter = list(cnt)[0]

    filter_list = [word for word in words if word[0] == first_letter]
    filter_list.sort(key = lambda s: len(s)) # To avoid iob

    prefix = ""
    length = len(filter_list[0])
    for i in range(length):
        test = filter_list[0][i]
        if all([word[i] == test for word in filter_list]):
            prefix += test
        else: break
    return [word[len(prefix):] if word.startswith(prefix) else word for word in words]

print(remove_mc_prefix(strings))

Out: ['apples', 'appleses', 'oranges', 'berries']

P_M · Answer 4 · 2022-12-02T01:09:23.993

To find the most-common-substring from list of python-string

I already tested on python-3.10.5 I hope it will work for you. I have the same use case but a different kind of task, I just need to find one common-pattern-string from a list of more than 100s files. To use as a regular-expression.

Your Basic example code is not working in my case. because 1st checking with 2nd, 2nd with 3rd, 3rd with 4th and so on. So, I change it to the most common substring and will check with each one.

The downside of this code is that if something is not common with the most common substring, the final most common substring will be an empty one. But in my case, it is working.


from difflib import SequenceMatcher
for i in range(1, len(names)):
    if i==1:
        string1, string2 = names[0], names[i]
    else:
        string1, string2 = most_common_substring, names[i]
    match = SequenceMatcher(None, string1, string2).find_longest_match(0, len(string1), 0, len(string2))
    most_common_substring = string1[match.a: match.a + match.size]

print(f"most_common_substring : {most_common_substring}")

python python-3 python-difflib

Find most common substring in a list of strings?

4 Answers4

Linked