python trie implementation with substring matching

Question

My problem: Let say I have strings:

ali, aligator, aliance

because they have common prefix I want to store them in trie, like:

trie['ali'] = None
trie['aligator'] = None
trie['aliance'] = None

So far so good - i can use trie implementation from Biopython library. But what I want to achive is abilitiy to find all keys in that trie that contain particular substring.

For example:

trie['ga'] would return 'aligator' and 
trie['li'] would return ('ali','aligator','aliance').

Any suggestions?

Do you want to find this once (so it doesn't matter if this is slow), or repeatedly (so it does matter)? — Phil H, Nov 19 '12 at 14:00
@PhilH The OP is asking about a specific implementation (trie) which implements this efficiently. — Konrad Rudolph, Nov 19 '12 at 14:01
repeatedly - I want to find all occurances and it must be fast. — mnowotka, Nov 19 '12 at 14:03
@JonClements - no - look at the interface. I need to search for substring not only for prefix. — mnowotka, Nov 19 '12 at 14:06
@KonradRudolph: A trie efficiently stores and retrieves strings with common stubs. It does not collate common subsequences, as far as I know. So to find common subsequences between different parts of the key tree is not efficient unless we also construct an index. But there is no point in the storage and insertion/deletion costs if the common subsequence search is rare. If it is common, then the benefits outweigh the costs and we should construct the index. Have I missed something there? — Phil H, Nov 19 '12 at 14:15
@mnowotka: Are there significant storage constraints? I'm guessing the set of keys is not insignficant if it's worth using a trie. — Phil H, Nov 19 '12 at 14:16
OK. My case is that I have 1.2 milion of strings like this: InChI=1S/C9H12/c1-4-7-8(5-2)9(7)6-3/h4-6H,1-3H3/b7-4-,8-5-,9-6- And I wan to make index to be able to find all strings that contain some given substring. The implementation doesn't have to use trie but it should be effective. — mnowotka, Nov 19 '12 at 14:19
You may need a different data structure. Something like a [String B-Tree](http://callisto.nsu.ru/documentation/CSIR/Algo/sbtree/StringBTree.pdf), which is essentially a combination of B-trees and Patricia tries. — Paolo Moretti, Nov 19 '12 at 14:29
@mnowotka you've got enough to be getting on with that then I think — Jon Clements, Nov 19 '12 at 14:36
@Phil The only definition of trie that I know is a prefix tree, and it *does* collate common prefixes – by definition. And furthermore, tries usually store not individual strings but enumerate all suffixes of all strings. As a consequence, they *do* collate all common substrings (because every substring is a prefix of a suffix). — Konrad Rudolph, Nov 19 '12 at 14:56
@KonradRudolph: I see what you mean. The problem is that to search for a common substring, the algorithm will have to descend every part of the tree to find the substring; there is no guarantee except length limits that a substring won't be further down the tree. So although you have eliminated comparisons to duplicate prefixes, it is now O(n) in the number of prefixes (prefix is the term I was trying to remember when I said 'stub' earlier). I don't think the question is necessarily about tries, as is apparent by the substring request. — Phil H, Nov 19 '12 at 16:03
@Phil No, if you have a prefix tree of suffixes then every possible substring can be reached directly from the root, and the search time is Θ(length of substring). — Konrad Rudolph, Nov 19 '12 at 16:05
@KonradRudolph: Could you elucidate that in an answer? It would be instructive both to me and anyone reading this in the future. — Phil H, Nov 19 '12 at 16:08
If your dataset doesn't change frequently, an O(n log n) suffix array implementation would be rather trivial. — Juan Lopes, Nov 19 '12 at 16:09
@PhilH After being immensely busy the last few weeks I came back to this question out of interest. Turns out, my memory confused me – unlike a suffix tree, which indexes all infixes of a string efficiently, a trie doesn’t actually do that. Somehow I had thought they were largely symmetrical but that’s nonsense. So everything you said up to this point, including your answer, is correct. — Konrad Rudolph, Nov 27 '12 at 14:35

score 1 · Answer 1 · edited May 23 '17 at 12:01

1

Edit: I think you may be looking for a Suffix tree, particularly noting that "Suffix trees also provided one of the first linear-time solutions for the longest common substring problem.".

Just noticed another SO question that seems very related: Finding longest common substring using Trie

edited May 23 '17 at 12:01

Community

1
1

answered Nov 19 '12 at 14:36

Phil H

19,928
7
68
105

That wasn't me ;) Here the problem is I don't know in advance how to split the string. So I would probabely need all possible substrings of all strings. This is called suffix array but I don't see any good python implementation. – mnowotka Nov 19 '12 at 14:45
So far, the library that meets my requirements closest is: https://hkn.eecs.berkeley.edu/~dyoo/python/suffix_trees/ – mnowotka Nov 19 '12 at 15:00
Have you looked at the links in the answers here? http://stackoverflow.com/questions/9347078/python-library-for-generalized-suffix-trees – Phil H Nov 19 '12 at 15:58
I've removed the original chunk of my answer, since suffix trees are a better solution. – Phil H Nov 19 '12 at 16:05

score -3 · Answer 2 · answered Nov 19 '12 at 15:10

I would do something like this:

class Trie(object):
    def __init__(self,strings=None):
        #set the inicial strings, if passed
        if strings:
            self.strings = strings
        else:
            self.strings =[]

    def __getitem__(self, item):
        #search for the partial string on the string list
        for s in self.strings:
            if item in s:
                yield s

    def __len__(self):
        #just for fun
        return len(self.strings)

    def append(self,*args):
        #append args to existing strings
        for item in args:
            if item not in self.strings:
                self.strings.append(item)

Then:

t1 = Trie()
t1.append("ali","aligator","aliance")
print list(t1['ga'])
print list(t1['li'])
>>['aligator']
>>['ali', 'aligator', 'aliance']

This appears to be just searching all the supplied strings for the target fragment, on every target fragment. That is O(nmk) in the number of strings to search n, the number of fragments m, and the average string length k. The point of the question is to find efficient ways to search, which I'm afraid this is not. — Phil H, Nov 19 '12 at 15:57
However, correct keyword. Beautiful implementation of trie can be found here: https://github.com/fnl/patricia-trie/blob/master/patricia.py#L77 — Nikolay Fominyh, Jun 12 '17 at 21:13

python trie implementation with substring matching

2 Answers2