Fast String within List Searching

Question

Using Python 3, I have a list containing over 100,000 strings (list1), each 300 characters long at most. I also have a list of over 9 million substrings (list2)--I want to count how many elements a substring in list2 appears in. For instance,

list1 = ['cat', 'caa', 'doa', 'oat']
list2 = ['at', 'ca', 'do']

I want the function to return (mapped to list2):

[2, 2, 1]

Normally, this is very simple and requires very little. However, due to the massive size of the lists, I have efficiency issues. I want to find the fastest way to return that counter list.

I've tried list comprehensions, generators, maps, loops of all kinds and I have yet to find an fast way to do this easy task. What is theoretically the fastest way to complete this objective, preferably taking O(len(list2)) steps very quickly?

score 2 · Answer 1 · answered Dec 18 '11 at 04:39

2

set M = len(list1) and N = len(list2)

For each of the N entries in list2 you're going to have to do M comparisons against the entries in list1. That's a worst case running time of O(M x N). If you take it further, lets take each entry in list2 to be of length 1 and each entry in list1 to be of length 300, then you got a running time of O(300M x N).

If performance is really an issue, try dynamic programming. Here is a start:

1) sort list2 in ascending order of length like so:

['scorch', 'scorching', 'dump', 'dumpster', 'dumpsters']

2) sort them into sublists such that each preceeding entry is a subset of the proceeding entry like so:

[['scorch', 'scorching'] , ['dump', 'dumpster', 'dumpsters']]

3) Now if you compare against list1 and 'scorch' is not in there, then you don't have to search for 'scorching' either. Likewise if 'dump' is not in there, neither is 'dumpster' or 'dumpsters'

note the worst case run time is still the same

answered Dec 18 '11 at 04:39

puk

16,318
29
119
199

That's a good idea, but each substring in list2 is in at least one element of list1. – user1104160 Dec 18 '11 at 04:43
1

This will require a great deal of overhead, but you could try indexing both `list1` and `list2` based on the characters they have, so if an entry of `list1` is `'abcd'` you wouldn't check the `list2` entry `'efg'`, only the `list2` entries which fall under the path/branch of `'a'`, `'b'`, `'c'` or `'d'` – puk Dec 18 '11 at 04:47
The same amount of steps would be taken though, right? Right now, for each substring in list2 I count by `sum(1 for string in list1 if substring in string)`. Wouldn't the process of checking for characters not included take the same amount of time as the if/in statement? – user1104160 Dec 18 '11 at 04:51
@user1104160 I could be mistaken, but I don't think you can get around the worst case scenario of `O(300MxN)`. If this is something that gets called quite often, I recommend taking the time to index, in a massive tree/array, based on length and/or letters – puk Dec 18 '11 at 04:55
Shoot. I was hoping that there would be some magic built-in function related to "count" that I could call that would be much faster than what I am doing now. The processes is done once, but it would take a very, very long time to run and it needs to done for like 10 different sets. – user1104160 Dec 18 '11 at 05:00
1

Hold on, I am trying to create a small example for you...man what a lousy way to spend one's Friday – puk Dec 18 '11 at 05:07
@user1104160 Wow, what exactly are you doing? I am trying to create 9 million randomly generated strings and it has taken over 5 minutes now – puk Dec 18 '11 at 05:16
Genomics. 9 million is on the small side. – user1104160 Dec 18 '11 at 05:33
Oddly enough getting the 9 million sequences was pretty fast, taking around 3 minutes. – user1104160 Dec 18 '11 at 05:39
I will work on the script I promised and post it later (probably tomorrow). – puk Dec 18 '11 at 06:06
I figured out a very fast way--using dictionaries, I can preallocate everything and go in the inverse direction--go through each substring in list1 and if they are present in the dictionary where list2 strings are the keys, then I can add one to that value. Thanks for your help, though! – user1104160 Dec 18 '11 at 17:56

score 2 · Answer 2 · edited May 23 '17 at 10:27

I believe this task can be solved in linear time with an Aho Corasick string matching machine. See this answer for additional information (maybe you get ideas from the other answers to that question, too - it is almost the same task and I think Aho Corasick is the theoretically fastest way to solve this).

You will have to modify the string matching machine in such way, that instead of returning the match it increases the counter of every matched substring by one. (This should be only a minor modification).

score 0 · Answer 3 · answered Dec 18 '11 at 15:57

Not sure how you could avoid having some sort of O(n**2) algorithm. Here is a simple implementation.

>>> def some_sort_of_count(list1, list2):
>>>     return [sum(x in y for y in list1) for x in list2]
>>> 
>>> list1 = ['cat', 'caa', 'doa', 'oat']
>>> list2 = ['at', 'ca', 'do']
>>> some_sort_of_count(list1, list2)
[2, 2, 1]

Fast String within List Searching

3 Answers3