Finding a substring's position in a larger string

Question

I have a large string and a large number of smaller substrings and I am trying to check if each substring exists in the larger string and get the position of each of these substrings.

string="some large text here"
sub_strings=["some", "text"]

for each_sub_string in sub_strings:
    if each_sub_string in string:
        print each_sub_string, string.index(each_sub_string)

The problem is, since I have a large number of substrings (around a million), it takes about an hour of processing time. Is there any way to reduce this time, maybe by using regular expressions or some other way?

You are doing a lot of extra work though, because while searching for one substring, you might potentially find another. — xrisk, Jun 26 '15 at 20:31
@Marged Actually, I have a large number of strings as well and I am using python's multiprocessing module to spawn a separate process for each of them. I have not considered running multiple threads for the sub-strings though. — Amith, Jun 26 '15 at 20:33
@RishavKundu That's true. Which is why I was thinking of using regular expressions and combining all the substrings together. Is there any way of combining them together for search using normal string processing? — Amith, Jun 26 '15 at 20:36
@Amith this might be of interest to you https://en.wikipedia.org/wiki/Rabin–Karp_algorithm#Multiple_pattern_search — xrisk, Jun 26 '15 at 20:38
http://stackoverflow.com/questions/1318126/using-rabin-karp-to-search-for-multiple-patterns-in-a-string — xrisk, Jun 26 '15 at 20:40

Slater Victoroff · Answer 1 · 2015-06-26T20:59:12.610

The best way to solve this is with a tree implementation. As Rishav mentioned, you're repeating a lot of work here. Ideally, this should be implemented as a tree-based FSM. Imagine the following example:

Large String: 'The cat sat on the mat, it was great'
Small Strings: ['cat', 'sat', 'ca']

Then imagine a tree where each level is an additional letter.

small_lookup = {
    'c': 
        ['a', {
            'a': ['t']
        }], {
    's':
        ['at']
    }
}

Apologies for the gross formatting, but I think it's helpful to map back to a python data structure directly. You can build a tree where the top level entries are the starting letters, and they map to the list of potential final substrings that could be completed. If you hit something that is a list element and has nothing more nested beneath you've hit a leaf and you know that you've hit the first instance of that substring.

Holding that tree in memory is a little hefty, but if you've only got a million string this should be the most efficient implementation. You should also make sure that you trim the tree as you find the first instance of words.

For those of you with CS chops, or if you want to learn more about this approach, it's a simplified version of the Aho-Corasick string matching algorithm.

If you're interested in learning more about these approaches there are three main algorithms used in practice:

Aho-Corasick (Basis of fgrep) [Worst case: O(m+n)]
Commentz-Walter (Basis of vanilla GNU grep) [Worst case: O(mn)]
Rabin-Karp (Used for plagiarism detection) [Worst case: O(mn)]

There are domains in which all of these algorithms will outperform the others, but based on the fact that you've got a very high number of sub-strings that you're searching and there's likely a lot of overlap between them I would bet that Aho-Corasick is going to give you significantly better performance than the other two methods as it avoid the O(mn) worst-case scenario

There is also a great python library that implements the Aho-Corasick algorithm found here that should allow you to avoid writing the gross implementation details yourself.

Thanks. Looks a bit tricky to implement though. But it looks like there is a library that can do this. — Amith, Jun 26 '15 at 21:06
@Amith my thoughts exactly, the link to the library at the end should make the process of implementation pretty trivial. — Slater Victoroff, Jun 26 '15 at 21:07

score 2 · Answer 2 · answered Jun 26 '15 at 20:35

Depending on the distribution of the lengths of your substrings, you might be able to shave off a lot of time using preprocessing.

Say the set of the lengths of your substrings form the set {23, 33, 45} (meaning that you might have millions of substrings, but each one takes one of these three lengths).

Then, for each of these lengths, find the Rabin Window over your large string, and place the results into a dictionary for that length. That is, let's take 23. Go over the large string, and find the 23-window hashes. Say the hash for position 0 is 13. So you insert into the dictionary rabin23 that 13 is mapped to [0]. Then you see that for position 1, the hash is 13 as well. Then in rabin23, update that 13 is mapped to [0, 1]. Then in position 2, the hash is 4. So in rabin23, 4 is mapped to [2].

Now, given a substring, you can calculate its Rabin hash and immediately check the relevant dictionary for the indices of its occurrence (which you then need to compare).

BTW, in many cases, then lengths of your substrings will exhibit a Pareto behavior, where say 90% of the strings are in 10% of the lengths. If so, you can do this for these lengths only.

Thanks. Sounds promising. Luckily I am working with chinese characters (names of people) and they tend to be around 3 or 4 characters generally. I will find out more about this. — Amith, Jun 26 '15 at 20:56

samgak · Answer 3 · 2015-06-28T08:52:27.610

This is approach is sub-optimal compared to the other answers, but might be good enough regardless, and is simple to implement. The idea is to turn the algorithm around so that instead of testing each sub-string in turn against the larger string, iterate over the large string and test against possible matching sub-strings at each position, using a dictionary to narrow down the number of sub-strings you need to test.

The output will differ from the original code in that it will be sorted in ascending order of index as opposed to by sub-string, but you can post-process the output to sort by sub-string if you want to.

Create a dictionary containing a list of sub-strings beginning each possible 1-3 characters. Then iterate over the string and at each character read the 1-3 characters after it and check for a match at that position for each sub-string in the dictionary that begins with those 1-3 characters:

string="some large text here"
sub_strings=["some", "text"]

# add each of the substrings to a dictionary based the first 1-3 characters
dict = {}
for s in sub_strings:
    if s[0:3] in dict:
        dict[s[0:3]].append(s)
    else:
        dict[s[0:3]] = [s];

 # iterate over the chars in string, testing words that match on first 1-3 chars
for i in range(0, len(string)):
    for j in range(1,4):
        char = string[i:i+j]
        if char in dict:        
            for word in dict[char]:
                if string[i:i+len(word)] == word:
                    print word, i

If you don't need to match any sub-strings 1 or 2 characters long then you can get rid of the for j loop and just assign char with char = string[i:3]

Using this second approach I timed the algorithm by reading in Tolstoy's War and Peace and splitting it into unique words, like this:

with open ("warandpeace.txt", "r") as textfile:
    string=textfile.read().replace('\n', '')    
sub_strings=list(set(string.split()))

Doing a complete search for every unique word in the text and outputting every instance of each took 124 seconds.

Finding a substring's position in a larger string

3 Answers3