According to https://en.wikipedia.org/wiki/Longest_common_substring#Suffix_tree, you can use a suffix tree to solve the problem.
The idea is as follows:
- Choose one suffix from each string.
- Find the longest common prefix of these suffixes.
- If the prefix is longer than original records, replace them. Or:
- If the prefix has the same length as original records, append it into the records.
If we can try every combination of suffixes, the substring(s) will be the longest. However, implementing this algorithm using loops has a poor performance. That's why suffix tree (a trie storing suffixes) is important. If you are unfamiliar with trie, here is its wiki: https://en.wikipedia.org/wiki/Trie. In short, trie is famous for proceeding prefixes, and by inserting suffixes, the prefixes becomes general substrings.
Suppose original string list is called list
. When we insert all suffixes of list[i]
, we attach information i
into visited nodes. When a node has all i
from 0
to len(list) - 1
, we can say the node is full. Of course, the information can be implemented with sets. Now the task becomes finding the longest full sequence(s) of the suffix tree.
Back to your problem: finding a common identifier for journals using dois. A doi is not that long, so generating substrings can be accomplished in expected time (although it's relatively long). And the following example doesn't consider Unicode characters, but I doubt they'll appear in dois.
Python code (I'm not an expert in Python, but you can regroup these functions and statements into a class for your usages):
from dataclasses import dataclass, field
@dataclass
class Node:
char: str
layer: int
parent: "Node"
# sources: set = field(default_factory=set)
# By using an integer counter we can save ~40% time
sources: int = 0
children: dict = field(default_factory=dict)
dois = [
'10.1001/jamacardio.2016.5501', '10.1001/jamacardio.2017.3145',
'10.1001/jamacardio.2018.3029', '10.1001/jamacardio.2020.5573',
'10.1001/jamacardio.2020.0647'
]
# Sort the input so the first doi has fewer substrings
dois.sort(key=len)
def full(node):
global dois
# return len(node.sources) == len(dois)
return node.sources == len(dois)
def find_nodes(root):
"""
Find ending nodes of full-node sequences.
Since nodes have the property `parent`, it would be easy to trace back,
and saving nodes is more efficient than building strings.
"""
results = []
maxh = 0
def dfs(node):
nonlocal results, maxh
'''
We can expect the full sequences to start from the top of the suffix
tree. If not, since an abnormal sequence is a suffix of other suffixes,
it conflicts with the fact that all possible suffixes have been inserted.
'''
if not full(node) and node.layer > 0:
return
if node.layer > maxh:
maxh = node.layer
results = []
if node.layer == maxh and maxh > 0:
results.append(node)
for next in node.children.values():
dfs(next)
dfs(root)
return results
def build_string(node):
'''
Get expected strings from ending nodes.
'''
s = ''
cur = node
while cur != None and full(cur):
s += cur.char
cur = cur.parent
# Reverse s. Weird that `str(reversed(s))` doesn't work
return ''.join(reversed(s))
def insert(root, s, source):
cur = root
for i in range(len(s)):
ch = s[i]
if ch not in cur.children:
cur.children[ch] = Node(ch, i + 1, cur)
cur = cur.children[ch]
# cur.sources.add(source)
if cur.sources == source:
cur.sources += 1
# All following nodes won't be full.
# This early return saves another 55% time.
elif cur.sources < source:
return
root = Node(0, 0, None)
# Insert all suffixes of dois into the tree.
for i in range(len(dois)):
doi = dois[i]
for j in range(len(doi)):
insert(root, doi[j:], i)
results = find_nodes(root)
# Transform nodes to strings.
for i in range(len(results)):
results[i] = build_string(results[i])
print(results)
There is a pruning step in insert()
, which essentially converts inserting all suffixes of dois into inserting only the suffixes of the first doi. So the size of the suffix tree doesn't grow in proportion to
len(dois)
.
To visualize the suffix tree, you can use Graphviz:
def toGraphviz(root):
s = 'digraph G {\n overlap=scale\n node [style=filled shape=circle]\n'
stack = []
stack.append(root)
while len(stack) > 0:
node: Node = stack.pop(len(stack) - 1)
s += f' {id(node)} [label=""]'
for k in node.children:
v = node.children[k]
s += f' "{id(node)}" -> "{id(v)}" [label="{k}" {"color=red penwidth=2.0" if full(v) else ""}]\n'
stack.append(v)
s += '}'
return s
# Graphviz should be installed first: https://graphviz.org/download/
# Use `dot tree.dot -Tsvg -o tree.svg` to render a svg file.
with open('tree.dot', 'w') as f:
f.write(toGraphviz(root))
Note that visualization only works when list
(dois
in the code) is small and strings are not too long. dot
can freeze if the input is too large. And even if it completes, you cannot really analyze such a large graph easily. When I changed dois
to ["ABAB", "BABA", "ABBA"]
(and without the pruing step), the visualized suffix tree was like:
![suffix tree of ["ABAB", "BABA", "ABBA"]](../../images/3815836781.webp)
All full sequences are marked as red, and ["AB", "BA"]
is the result. Your original example can be visualized as well, but the image is way too large to display here.
BTW, there is another question similar to this one, and the answers there mostly involves dynamic programming: Longest common substring from more than two strings. I'd admit DP is more precise.
Edit: The DP solution in the link runs much faster than this suffix tree solution. I think that's the real answer.