How to find the longest common substring in a list of strings (>2 strings)? Trying FuzzyWuzzy and Sequence matcher

Question

So I am trying to find a common identifier for journals using dois. For example, I have a list of dois for a journal: ['10.1001/jamacardio.2016.5501', '10.1001/jamacardio.2017.3145', '10.1001/jamacardio.2018.3029', '10.1001/jamacardio.2020.5573', '10.1001/jamacardio.2020.0647']

(The list is much longer than this)

I want to find the longest common substring in my list. I have tried SequenceMatcher but can only look for similarity between 2 strings.

 journal_list
 
 def longestSubstring(str1,str2):
  
     #initialize SequenceMatcher object with 
     #input string
     seqMatch = SequenceMatcher(None,str1,str2)
  
     #find match of longest sub-string
     #output will be like Match(a=0, b=0, size=5)
     match = seqMatch.find_longest_match(0, len(str1), 0, len(str2))
  
     if (match.size!=0):
         print (str1[match.a: match.a + match.size]) 
     else:
         print ('No longest common sub-string found')
 
     for journal in journal_list:
         str1 = journal_list[1]
 
 print(longestSubstring(str1,journal))

Expected output:

'10.1001/jamacardio.20'

Are you actually looking for substrings in general, ot just the initial part of the doi specifier? — molbdnilo, Jan 19 '23 at 14:45
ideally substrings in general since there are rare cases where the initial part of the doi might different within the same journal. But for now, there is a simple solution there, that would work @molbdnilo — msci, Jan 19 '23 at 14:47
So you actually want '10.1001/jamacardio.2017.3145' and '20.2002/iamarobot.1001.1234' to have the longest common substring '.1001'? — molbdnilo, Jan 20 '23 at 07:36
I think a better solution would be to actually parse the doi identifier and then compare their constituent parts. — molbdnilo, Jan 20 '23 at 07:38

molbdnilo · Answer 1 · 2023-01-19T18:49:46.463

0

I think it's overkill to use any fancy matching library for this and would start with a function that works with two strings:

def common_2(s1, s2):
    longest = ""
    for i in range(min(len(s1), len(s2))):
        if s1[i] == s2[i]:
            longest += s1[i]
        else:
            break
    return longest

Then just apply this repeatedly to all the strings:

def common(ss):
    if len(ss) < 1:
        return ""
    if len(ss) == 1:
        return ss[0]
    part = common_2(ss[0], ss[1])
    for i in range(2, len(ss)):
        part = common_2(part, ss[i])
    return part


>>> journals = ['10.1001/jamacardio.2016.5501', '10.1001/jamacardio.2017.3145', '10.1001/jamacardio.2018.3029', '10.1001/jamacardio.2020.5573', '10.1001/jamacardio.2020.0647']
>>> common(journals)
'10.1001/jamacardio.20'

This only finds the common prefix; if you want general substrings, just modify common_2.

edited Jan 19 '23 at 18:49

answered Jan 19 '23 at 14:20

molbdnilo

64,751
3
43
82

1

This does not search for the longest common substring, but for the longest common prefix. – maxbachmann Jan 19 '23 at 14:42
If I understand the question correctly, that;'s what the actual problem is, mistakenly generalized to substrings. – molbdnilo Jan 19 '23 at 14:43
I tried running this code but I'm getting "name 'initial' is not defined" – msci Jan 19 '23 at 16:16
The longest common substring of "abcdxy", "abcexy" and "axy" is "xy". You'd find "a". – Kelly Bundy Jan 20 '23 at 05:16
And for *prefix*, you could just use `os.path.commonprefix`. – Kelly Bundy Jan 20 '23 at 05:29
@KellyBundy These are not paths. – molbdnilo Jan 20 '23 at 07:39
@molbdnilo `os.path.commonprefix` is an evil function which doesn't necessarily return a valid path: https://docs.python.org/3/library/os.path.html#os.path.commonprefix. It just treats paths as strings and finds their common prefix. – Simon Smith Jan 20 '23 at 09:26
I know, it might not be the intended use, but it works. What about the other issue I pointed out, that your substring version is wrong? You still left that in your answer. – Kelly Bundy Jan 20 '23 at 16:15
@KellyBundy Try reading the last sentence more carefully. (And "it works" is not a good reason.) – molbdnilo Jan 21 '23 at 13:11
How am I supposed to read it? It's wrong, and I've already shown it with an example input. Or how exactly would you modify `common_2` so that you'd get "xy" instead of "a"? – Kelly Bundy Jan 21 '23 at 16:47
Still wondering ... – Kelly Bundy Jan 23 '23 at 19:29
Do you just not care that it's wrong? – Kelly Bundy Jan 25 '23 at 02:28

Simon Smith · Answer 2 · 2023-01-22T14:19:26.607

According to https://en.wikipedia.org/wiki/Longest_common_substring#Suffix_tree, you can use a suffix tree to solve the problem.

The idea is as follows:

Choose one suffix from each string.
Find the longest common prefix of these suffixes.
If the prefix is longer than original records, replace them. Or:
If the prefix has the same length as original records, append it into the records.

If we can try every combination of suffixes, the substring(s) will be the longest. However, implementing this algorithm using loops has a poor performance. That's why suffix tree (a trie storing suffixes) is important. If you are unfamiliar with trie, here is its wiki: https://en.wikipedia.org/wiki/Trie. In short, trie is famous for proceeding prefixes, and by inserting suffixes, the prefixes becomes general substrings.

Suppose original string list is called list. When we insert all suffixes of list[i], we attach information i into visited nodes. When a node has all i from 0 to len(list) - 1, we can say the node is full. Of course, the information can be implemented with sets. Now the task becomes finding the longest full sequence(s) of the suffix tree.

Back to your problem: finding a common identifier for journals using dois. A doi is not that long, so generating substrings can be accomplished in expected time (although it's relatively long). And the following example doesn't consider Unicode characters, but I doubt they'll appear in dois.

Python code (I'm not an expert in Python, but you can regroup these functions and statements into a class for your usages):

from dataclasses import dataclass, field


@dataclass
class Node:
    char: str
    layer: int
    parent: "Node"
    # sources: set = field(default_factory=set)
    # By using an integer counter we can save ~40% time
    sources: int = 0
    children: dict = field(default_factory=dict)


dois = [
    '10.1001/jamacardio.2016.5501', '10.1001/jamacardio.2017.3145',
    '10.1001/jamacardio.2018.3029', '10.1001/jamacardio.2020.5573',
    '10.1001/jamacardio.2020.0647'
]

# Sort the input so the first doi has fewer substrings
dois.sort(key=len)

def full(node):
    global dois
    # return len(node.sources) == len(dois)
    return node.sources == len(dois)


def find_nodes(root):
    """
    Find ending nodes of full-node sequences.
    Since nodes have the property `parent`, it would be easy to trace back,
    and saving nodes is more efficient than building strings.
    """
    results = []
    maxh = 0

    def dfs(node):
        nonlocal results, maxh
        '''
        We can expect the full sequences to start from the top of the suffix 
        tree. If not, since an abnormal sequence is a suffix of other suffixes, 
        it conflicts with the fact that all possible suffixes have been inserted.
        '''
        if not full(node) and node.layer > 0:
            return
        if node.layer > maxh:
            maxh = node.layer
            results = []
        if node.layer == maxh and maxh > 0:
            results.append(node)

        for next in node.children.values():
            dfs(next)

    dfs(root)
    return results


def build_string(node):
    '''
    Get expected strings from ending nodes.
    '''
    s = ''
    cur = node
    while cur != None and full(cur):
        s += cur.char
        cur = cur.parent
    # Reverse s. Weird that `str(reversed(s))` doesn't work
    return ''.join(reversed(s))


def insert(root, s, source):
    cur = root
    for i in range(len(s)):
        ch = s[i]
        if ch not in cur.children:
            cur.children[ch] = Node(ch, i + 1, cur)
        cur = cur.children[ch]
        # cur.sources.add(source)
        if cur.sources == source:
            cur.sources += 1
        # All following nodes won't be full.
        # This early return saves another 55% time.
        elif cur.sources < source:
            return


root = Node(0, 0, None)

# Insert all suffixes of dois into the tree.
for i in range(len(dois)):
    doi = dois[i]
    for j in range(len(doi)):
        insert(root, doi[j:], i)
        
results = find_nodes(root)

# Transform nodes to strings.
for i in range(len(results)):
    results[i] = build_string(results[i])

print(results)

There is a pruning step in insert(), which essentially converts inserting all suffixes of dois into inserting only the suffixes of the first doi. So the size of the suffix tree doesn't grow in proportion to len(dois).

To visualize the suffix tree, you can use Graphviz:

def toGraphviz(root):
    s = 'digraph G {\n  overlap=scale\n  node [style=filled shape=circle]\n'
    stack = []
    stack.append(root)
    while len(stack) > 0:
        node: Node = stack.pop(len(stack) - 1)
        s += f'  {id(node)} [label=""]'
        for k in node.children:
            v = node.children[k]
            s += f'  "{id(node)}" -> "{id(v)}" [label="{k}" {"color=red penwidth=2.0" if full(v) else ""}]\n'
            stack.append(v)
    s += '}'
    return s

# Graphviz should be installed first: https://graphviz.org/download/
# Use `dot tree.dot -Tsvg -o tree.svg` to render a svg file.
with open('tree.dot', 'w') as f:
    f.write(toGraphviz(root))

Note that visualization only works when list (dois in the code) is small and strings are not too long. dot can freeze if the input is too large. And even if it completes, you cannot really analyze such a large graph easily. When I changed dois to ["ABAB", "BABA", "ABBA"] (and without the pruing step), the visualized suffix tree was like:

suffix tree of ["ABAB", "BABA", "ABBA"]

All full sequences are marked as red, and ["AB", "BA"] is the result. Your original example can be visualized as well, but the image is way too large to display here.

BTW, there is another question similar to this one, and the answers there mostly involves dynamic programming: Longest common substring from more than two strings. I'd admit DP is more precise.

Edit: The DP solution in the link runs much faster than this suffix tree solution. I think that's the real answer.

score 0 · Answer 3 · answered Jan 21 '23 at 17:07

0

Simple brute force one, with your list multiplied by 1000 (so it's 5000 strings) it still takes less than a second:

ss = ['10.1001/jamacardio.2016.5501', '10.1001/jamacardio.2017.3145', '10.1001/jamacardio.2018.3029', '10.1001/jamacardio.2020.5573', '10.1001/jamacardio.2020.0647']

def substrings(s):
    return {s[i:j]
            for j in range(len(s)+1)
            for i in range(j+1)}

common = set.intersection(*map(substrings, ss))

print(max(common, key=len))

answered Jan 21 '23 at 17:07

Kelly Bundy

23,480
7
29
65

What's the issue? It's correct and simple, unlike the other answers. Maybe not the most efficient, but quite possibly efficient enough, as mentioned. – Kelly Bundy Jan 22 '23 at 06:50

How to find the longest common substring in a list of strings (>2 strings)? Trying FuzzyWuzzy and Sequence matcher

3 Answers3