113

I'd like to compare 2 strings and keep the matched, splitting off where the comparison fails.

So if I have 2 strings:

string1 = "apples"
string2 = "appleses"

answer = "apples"

Another example, as the string could have more than one word:

string1 = "apple pie available"
string2 = "apple pies"

answer = "apple pie"

I'm sure there is a simple Python way of doing this but I can't work it out, any help and explanation appreciated.

funnydman
  • 9,083
  • 4
  • 40
  • 55
NorthSide
  • 1,459
  • 2
  • 12
  • 16

20 Answers20

193

For completeness, difflib in the standard-library provides loads of sequence-comparison utilities. For instance find_longest_match which finds the longest common substring when used on strings. Example use:

from difflib import SequenceMatcher

string1 = "apple pie available"
string2 = "come have some apple pies"

match = SequenceMatcher(None, string1, string2).find_longest_match()

print(match)  # -> Match(a=0, b=15, size=9)
print(string1[match.a:match.a + match.size])  # -> apple pie
print(string2[match.b:match.b + match.size])  # -> apple pie

If you're using a version older than 3.9, you'need to call find_longest_match() with the following arguments:

SequenceMatcher(None, string1, string2).find_longest_match(0, len(string1), 0, len(string2))
Francisco
  • 10,918
  • 6
  • 34
  • 45
RickardSjogren
  • 4,070
  • 3
  • 17
  • 26
  • 11
    Heads up to those using this on longer strings, you might want to set the kwarg "autojunk" to False when creating the instance of SequenceMatcher. – MLP Aug 20 '18 at 03:57
  • 2
    I'll note that there are outstanding bugs in difflib that should prevent its use in real-world scenarios. For example, it seems that the well known 'heuristic' interferes with the completeness of methods such as 'get_matching_blocks'. – W4t3randWind Oct 27 '18 at 16:18
  • 11
    **Warning: This answer does not find the longest common substring!** Despite its name (and the method's documentation), `find_longest_match()` does not do what its name implies. The class documentation for `SequenceMatcher` does hint at this, however, saying: `This does not yield minimal edit sequences`. For example, in some cases, `find_longest_match()` will claim there are *no* matches in two strings of length 1000, even though there are matching substrings of length > 500. – Aleksi Torhamo Mar 19 '19 at 14:00
  • 8
    man, what turkey wrote that API. Forcing you to put the lengths of the strings in everytime instead of just assume its the ful strings, and the first argument to SequenceMatcher is nearly always going to be None :@ – CpILL Oct 06 '21 at 04:15
  • 1
    @CpILL default arguments were added on Python 3.9. – Francisco Mar 29 '22 at 09:04
52

One might also consider os.path.commonprefix that works on characters and thus can be used for any strings.

import os
common = os.path.commonprefix(['apple pie available', 'apple pies'])
assert common == 'apple pie'

As the function name indicates, this only considers the common prefix of two strings.

jonas
  • 1,074
  • 11
  • 11
  • 4
    It doesn't work, when compare string like ['an apple pie available', 'apple pies']. – GoTop Feb 10 '19 at 03:04
  • 2
    Clarified answer, it should be clear what this solution does now. The question is a bit vague in that regard. The title suggests "any substring", description and examples indicate "common prefix". – jonas Jun 29 '20 at 08:48
  • @famzah You linked to the documentation of `os.commonpath` this is not the same as the `os.commonprefix` that is used in the answer. But true, there could be some limitations, just the documentation does not mention any. – jonas Nov 08 '20 at 13:00
41
def common_start(sa, sb):
    """ returns the longest common substring from the beginning of sa and sb """
    def _iter():
        for a, b in zip(sa, sb):
            if a == b:
                yield a
            else:
                return

    return ''.join(_iter())
>>> common_start("apple pie available", "apple pies")
'apple pie'

Or a slightly stranger way:

def stop_iter():
    """An easy way to break out of a generator"""
    raise StopIteration

def common_start(sa, sb):
    return ''.join(a if a == b else stop_iter() for a, b in zip(sa, sb))

Which might be more readable as

def terminating(cond):
    """An easy way to break out of a generator"""
    if cond:
        return True
    raise StopIteration

def common_start(sa, sb):
    return ''.join(a for a, b in zip(sa, sb) if terminating(a == b))
Rahul K P
  • 15,740
  • 4
  • 35
  • 52
Eric
  • 95,302
  • 53
  • 242
  • 374
  • 11
    This solution, as of now, isn't complete. It only compares both strings from the zeroth position. For instance: >>> common_start("XXXXXapple pie available", "apple pies") returns an empty string. – Nitin Nain Sep 07 '14 at 19:36
  • 4
    @NitinNain: That was never clarified in the original question. But yes, this solution only finds the common _start_ of strings – Eric Sep 07 '14 at 21:56
  • 1
    will this work once [PEP479](http://legacy.python.org/dev/peps/pep-0479/) is in effect? – Janus Troelsen Aug 19 '15 at 08:36
  • 1
    No - from [that document](http://legacy.python.org/dev/peps/pep-0479/#consequences-for-existing-code): _"There are also examples of generator expressions floating around that rely on a StopIteration raised by the expression, the target **or the predicate** (rather than by the __next__() call implied in the for loop proper)._" – Eric Aug 19 '15 at 20:11
  • 1
    @Eric still, from the [Python 3.6 release notes](https://docs.python.org/3/whatsnew/3.6.html#deprecated-python-behavior), `Raising the StopIteration exception inside a generator will now generate a DeprecationWarning`. If you run your code with `Python3 -W default::DeprecationWarning`, the last two examples both raise `DeprecationWarning`s – jpyams Dec 20 '17 at 16:10
  • @Eric, thank you for *coding according to spec* -- this was exactly what I needed. :-) – KlaymenDK Feb 02 '18 at 11:12
13

Its called Longest Common Substring problem. Here I present a simple, easy to understand but inefficient solution. It will take a long time to produce correct output for large strings, as the complexity of this algorithm is O(N^2).

def longestSubstringFinder(string1, string2):
    answer = ""
    len1, len2 = len(string1), len(string2)
    for i in range(len1):
        match = ""
        for j in range(len2):
            if (i + j < len1 and string1[i + j] == string2[j]):
                match += string2[j]
            else:
                if (len(match) > len(answer)): answer = match
                match = ""
    return answer

print(longestSubstringFinder("apple pie available", "apple pies"))
print(longestSubstringFinder("apples", "appleses"))
print(longestSubstringFinder("bapples", "cappleses"))

Output

apple pie
apples
apples
Francisco
  • 10,918
  • 6
  • 34
  • 45
thefourtheye
  • 233,700
  • 52
  • 457
  • 497
  • 9
    This algorithm is incorrect with given some inputs (e.g. "apple pie...", "apple pie") but works if you switch parameter position. I think there's something wrong with the if statement when you compare `i+j < len1` – REALFREE Jul 09 '14 at 06:47
  • this works for the longest prefix and breaks on suffixes. E.g. `x = "cov_basic_as_cov_x_gt_y_rna_genes_w1000000" y = "cov_rna15pcs_as_cov_x_gt_y_rna_genes_w1000000"` – Dima Lituiev Feb 22 '16 at 18:51
  • If it's just about finding the longest common sub-string would be an overkill. All you need to find if there is a common sub-string. A suffix tree based representation will be an efficient way to search if there is a common in the given 2 strings. search will be O(m) complexity where m is the length of the shorter string. Additional storage required for Tree representation. We could optimize the storage by storing just the indexes in the tree and not actual chars. – lalatnayak Jul 02 '17 at 19:37
  • 15
    its totaly wrong. try string1="2193588" , string2="21943588" – Nozar Safari Feb 14 '18 at 08:28
  • 5
    this needs to get down votes to get removed ...this is a wrong answer... – grepit Dec 06 '18 at 06:10
  • 3
    This doesn't work because it does not consider scenario where you will need to do a "re-matching" for the second string. For instance, in "acdaf" vs "acdacdaf", when starting from "a" of the first string it will match all the way till the "acda" part of the second string, then it will break at c. Then no matter what you can no longer pick up acdaf. – Tamaki Sakura Jan 24 '19 at 22:34
  • 2
    this is a wrong answer, won't even work on equal inputs, which should output one of the inputs – mounaim Mar 01 '21 at 23:23
  • It is a wrong solution. Don't know why it is accepted. Failed for s1 = 'leetcode' and s2='etco' – Saket Suraj Mar 03 '22 at 14:28
  • This algorithm has a bug longestSubstringFinder('149097_(08-03)', '149097') = '9' – Marat Zakirov Mar 23 '23 at 13:08
11

Fix bugs with the first's answer:

def longestSubstringFinder(string1, string2):
    answer = ""
    len1, len2 = len(string1), len(string2)
    for i in range(len1):
        for j in range(len2):
            lcs_temp = 0
            match = ''
            while ((i+lcs_temp < len1) and (j+lcs_temp<len2) and string1[i+lcs_temp] == string2[j+lcs_temp]):
                match += string2[j+lcs_temp]
                lcs_temp += 1
            if len(match) > len(answer):
                answer = match
    return answer

print(longestSubstringFinder("dd apple pie available", "apple pies"))
print(longestSubstringFinder("cov_basic_as_cov_x_gt_y_rna_genes_w1000000", "cov_rna15pcs_as_cov_x_gt_y_rna_genes_w1000000")
print(longestSubstringFinder("bapples", "cappleses"))
print(longestSubstringFinder("apples", "apples"))
Francisco
  • 10,918
  • 6
  • 34
  • 45
user7733798
  • 111
  • 1
  • 2
6

The same as Evo's, but with arbitrary number of strings to compare:

def common_start(*strings):
    """ Returns the longest common substring
        from the beginning of the `strings`
    """
    def _iter():
        for z in zip(*strings):
            if z.count(z[0]) == len(z):  # check all elements in `z` are the same
                yield z[0]
            else:
                return

    return ''.join(_iter())
Community
  • 1
  • 1
SergeyR
  • 468
  • 5
  • 10
5

The fastest way I've found is to use suffix_trees package:

from suffix_trees import STree

a = ["xxxabcxxx", "adsaabc"]
st = STree.STree(a)
print(st.lcs()) # "abc"
Andrey
  • 5,932
  • 3
  • 17
  • 35
3

This script requests you the minimum common substring length and gives all common substrings in two strings. Also, it eliminates shorter substrings that longer substrings include already.

def common_substrings(str1,str2):
    len1,len2=len(str1),len(str2)

    if len1 > len2:
        str1,str2=str2,str1 
        len1,len2=len2,len1
    #short string=str1 and long string=str2

    min_com = int(input('Please enter the minumum common substring length:'))
    
    cs_array=[]
    for i in range(len1,min_com-1,-1):
        for k in range(len1-i+1):
            if (str1[k:i+k] in str2):
                flag=1
                for m in range(len(cs_array)):
                    if str1[k:i+k] in cs_array[m]:
                    #print(str1[k:i+k])
                        flag=0
                        break
                if flag==1:
                    cs_array.append(str1[k:i+k])
    if len(cs_array):
        print(cs_array)
    else:
        print('There is no any common substring according to the parametres given')

common_substrings('ciguliuana','ciguana')
common_substrings('apples','appleses')
common_substrings('apple pie available','apple pies')
serko
  • 31
  • 3
2

Try:

import itertools as it
''.join(el[0] for el in it.takewhile(lambda t: t[0] == t[1], zip(string1, string2)))

It does the comparison from the beginning of both strings.

Birei
  • 35,723
  • 2
  • 77
  • 82
  • I'm now wanting python to make `it.takewhile` a language feature: `a for a, b in zip(string1, string2) while a == b` – Eric Sep 10 '13 at 11:23
  • `''.join(el[0] for el in itertools.takewhile(lambda t: t[0] == t[1], zip("ahello", "hello")))` returns `""`, which appears to be incorrect. The correct result would be `"hello"`. – Anderson Green Jan 26 '15 at 05:43
  • @AndersonGreen: You are right, it doesn't answer exactly the question, althought his examples only took into account the starting point at first char and I pointed out it in my answer too. – Birei Jan 26 '15 at 12:17
2
def matchingString(x,y):
    match=''
    for i in range(0,len(x)):
        for j in range(0,len(y)):
            k=1
            # now applying while condition untill we find a substring match and length of substring is less than length of x and y
            while (i+k <= len(x) and j+k <= len(y) and x[i:i+k]==y[j:j+k]):
                if len(match) <= len(x[i:i+k]):
                   match = x[i:i+k]
                k=k+1
    return match  

print matchingString('apple','ale') #le
print matchingString('apple pie available','apple pies') #apple pie     
radhikesh93
  • 870
  • 9
  • 25
1

A Trie data structure would work the best, better than DP. Here is the code.

class TrieNode:
    def __init__(self):
        self.child = [None]*26
        self.endWord = False

class Trie:

    def __init__(self):
        self.root = self.getNewNode()

    def getNewNode(self):
        return TrieNode()

    def insert(self,value):
        root = self.root


        for i,character in enumerate(value):
            index = ord(character) - ord('a')
            if not root.child[index]:
                root.child[index] = self.getNewNode()
            root = root.child[index]

        root.endWord = True


    def search(self,value):
        root = self.root

        for i,character in enumerate(value):
            index = ord(character) - ord('a')
            if not root.child[index]:
                return False
            root = root.child[index]
        return root.endWord

def main(): 

    # Input keys (use only 'a' through 'z' and lower case) 
    keys = ["the","anaswe"] 
    output = ["Not present in trie", 
            "Present in trie"] 

    # Trie object 
    t = Trie() 

    # Construct trie 
    for key in keys: 
        t.insert(key) 

    # Search for different keys 
    print("{} ---- {}".format("the",output[t.search("the")])) 
    print("{} ---- {}".format("these",output[t.search("these")])) 
    print("{} ---- {}".format("their",output[t.search("their")])) 
    print("{} ---- {}".format("thaw",output[t.search("thaw")])) 

if __name__ == '__main__': 
    main() 

Let me know in case of doubts.

David García Bodego
  • 1,058
  • 3
  • 13
  • 21
1

In case we have a list of words that we need to find all common substrings I check some of the codes above and the best was https://stackoverflow.com/a/42882629/8520109 but it has some bugs for example 'histhome' and 'homehist'. In this case, we should have 'hist' and 'home' as a result. Furthermore, it differs if the order of arguments is changed. So I change the code to find every block of substring and it results a set of common substrings:

main = input().split(" ")    #a string of words separated by space
def longestSubstringFinder(string1, string2):
    '''Find the longest matching word'''
    answer = ""
    len1, len2 = len(string1), len(string2)
    for i in range(len1):
        for j in range(len2):
            lcs_temp=0
            match=''
            while ((i+lcs_temp < len1) and (j+lcs_temp<len2) and string1[i+lcs_temp] == string2[j+lcs_temp]):
                match += string2[j+lcs_temp]
                lcs_temp+=1         
            if (len(match) > len(answer)):
                answer = match              
    return answer

def listCheck(main):
    '''control the input for finding substring in a list of words'''
    string1 = main[0]
    result = []
    for i in range(1, len(main)):
        string2 = main[i]
        res1 = longestSubstringFinder(string1, string2)
        res2 = longestSubstringFinder(string2, string1)
        result.append(res1)
        result.append(res2)
    result.sort()
    return result

first_answer = listCheck(main)

final_answer  = []


for item1 in first_answer:    #to remove some incorrect match
    string1 = item1
    double_check = True
    for item2 in main:
        string2 = item2
        if longestSubstringFinder(string1, string2) != string1:
            double_check = False
    if double_check:
        final_answer.append(string1)

print(set(final_answer))

main = 'ABACDAQ BACDAQA ACDAQAW XYZCDAQ' #>>> {'CDAQ'}
main = 'homehist histhome' #>>> {'hist', 'home'}

rahimz
  • 23
  • 1
  • 7
1
def LongestSubString(s1,s2):
    if len(s1)<len(s2) :
        s1,s2 = s2,s1  
    
    maxsub =''
    for i in range(len(s2)):
        for j in range(len(s2),i,-1):
            if s2[i:j] in s1 and j-i>len(maxsub):                
                return  s2[i:j]
JiPiBi
  • 11
  • 1
  • 1
    I recommend adding a `return ''` at the end, since the in degenerate case, you do not want to return `None` (as python does by default); you instead want to return the empty string. – Max Bileschi Dec 10 '21 at 15:17
  • Where do you update `maxsub` – Coddy Aug 15 '23 at 22:19
0

Returns the first longest common substring:

def compareTwoStrings(string1, string2):
    list1 = list(string1)
    list2 = list(string2)

    match = []
    output = ""
    length = 0

    for i in range(0, len(list1)):

        if list1[i] in list2:
            match.append(list1[i])

            for j in range(i + 1, len(list1)):

                if ''.join(list1[i:j]) in string2:
                    match.append(''.join(list1[i:j]))

                else:
                    continue
        else:
            continue

    for string in match:

        if length < len(list(string)):
            length = len(list(string))
            output = string

        else:
            continue

    return output
modulus
  • 28
  • 5
0

This is the classroom problem called 'Longest sequence finder'. I have given some simple code that worked for me, also my inputs are lists of a sequence which can also be a string:

def longest_substring(list1,list2):
    both=[]
    if len(list1)>len(list2):
        small=list2
        big=list1
    else:
        small=list1
        big=list2
    removes=0
    stop=0
    for i in small:
        for j in big:
            if i!=j:
                removes+=1
                if stop==1:
                    break
            elif i==j:
                both.append(i)
                for q in range(removes+1):
                    big.pop(0)
                stop=1
                break
        removes=0
    return both
0
**Return the comman longest substring** 
def longestSubString(str1, str2):
    longestString = ""
    maxLength = 0
    for i in range(0, len(str1)):
        if str1[i] in str2:
            for j in range(i + 1, len(str1)):
                if str1[i:j] in str2:
                    if(len(str1[i:j]) > maxLength):
                        maxLength = len(str1[i:j])
                        longestString =  str1[i:j]
return longestString
0

As if this question doesn't have enough answers, here's another option:

from collections import defaultdict
def LongestCommonSubstring(string1, string2):
    match = ""
    matches = defaultdict(list)
    str1, str2 = sorted([string1, string2], key=lambda x: len(x))

    for i in range(len(str1)):
        for k in range(i, len(str1)):
            cur = match + str1[k]
            if cur in str2:
                match = cur
            else:
                match = ""
            
            if match:
                matches[len(match)].append(match)
        
    if not matches:
        return ""

    longest_match = max(matches.keys())
        
    return matches[longest_match][0]

Some example cases:

LongestCommonSubstring("whose car?", "this is my car")
> ' car'
LongestCommonSubstring("apple pies", "apple? forget apple pie!")
> 'apple pie'

mr.plow
  • 43
  • 7
-1

This isn't the most efficient way to do it but it's what I could come up with and it works. If anyone can improve it, please do. What it does is it makes a matrix and puts 1 where the characters match. Then it scans the matrix to find the longest diagonal of 1s, keeping track of where it starts and ends. Then it returns the substring of the input string with the start and end positions as arguments.

Note: This only finds one longest common substring. If there's more than one, you could make an array to store the results in and return that Also, it's case sensitive so (Apple pie, apple pie) will return pple pie.

def longestSubstringFinder(str1, str2):
answer = ""

if len(str1) == len(str2):
    if str1==str2:
        return str1
    else:
        longer=str1
        shorter=str2
elif (len(str1) == 0 or len(str2) == 0):
    return ""
elif len(str1)>len(str2):
    longer=str1
    shorter=str2
else:
    longer=str2
    shorter=str1

matrix = numpy.zeros((len(shorter), len(longer)))

for i in range(len(shorter)):
    for j in range(len(longer)):               
        if shorter[i]== longer[j]:
            matrix[i][j]=1

longest=0

start=[-1,-1]
end=[-1,-1]    
for i in range(len(shorter)-1, -1, -1):
    for j in range(len(longer)):
        count=0
        begin = [i,j]
        while matrix[i][j]==1:

            finish=[i,j]
            count=count+1 
            if j==len(longer)-1 or i==len(shorter)-1:
                break
            else:
                j=j+1
                i=i+1

        i = i-count
        if count>longest:
            longest=count
            start=begin
            end=finish
            break

answer=shorter[int(start[0]): int(end[0])+1]
return answer
Rali Tsanova
  • 67
  • 1
  • 5
-1

First a helper function adapted from the itertools pairwise recipe to produce substrings.

import itertools
def n_wise(iterable, n = 2):
    '''n = 2 -> (s0,s1), (s1,s2), (s2, s3), ...

    n = 3 -> (s0,s1, s2), (s1,s2, s3), (s2, s3, s4), ...'''
    a = itertools.tee(iterable, n)
    for x, thing in enumerate(a[1:]):
        for _ in range(x+1):
            next(thing, None)
    return zip(*a)

Then a function the iterates over substrings, longest first, and tests for membership. (efficiency not considered)

def foo(s1, s2):
    '''Finds the longest matching substring
    '''
    # the longest matching substring can only be as long as the shortest string
    #which string is shortest?
    shortest, longest = sorted([s1, s2], key = len)
    #iterate over substrings, longest substrings first
    for n in range(len(shortest)+1, 2, -1):
        for sub in n_wise(shortest, n):
            sub = ''.join(sub)
            if sub in longest:
                #return the first one found, it should be the longest
                return sub

s = "fdomainster"
t = "exdomainid"
print(foo(s,t))

>>> 
domain
>>> 
wwii
  • 23,232
  • 7
  • 37
  • 77
-1
def LongestSubString(s1,s2):
    left = 0
    right =len(s2)
    while(left<right):
        if(s2[left] not in s1):
            left = left+1
        else:
            if(s2[left:right] not in s1):
                right = right - 1
            else:
                return(s2[left:right])

s1 = "pineapple"
s2 = "applc"
print(LongestSubString(s1,s2))
GoTop
  • 850
  • 1
  • 9
  • 22
user3838498
  • 29
  • 1
  • 1
  • 5