Fastest way in Python to find a 'startswith' substring in a long sorted list of strings

Question

I've done a lot of Googling, but haven't found anything, so I'm really sorry if I'm just searching for the wrong things.

I am writing an implementation of the Ghost for MIT Introduction to Programming, assignment 5.

As part of this, I need to determine whether a string of characters is the start of any valid word. I have a list of valid words ("wordlist").

Update: I could use something that iterated through the list each time, such as Peter's simple suggestion:

def word_exists(wordlist, word_fragment):
return any(w.startswith(word_fragment) for w in wordlist)

I previously had:

wordlist = [w for w in wordlist if w.startswith(word_fragment)]

(from here) to narrow the list down to the list of valid words that start with that fragment and consider it a loss if wordlist is empty. The reason that I took this approach was that I (incorrectly, see below) thought that this would save time, as subsequent lookups would only have to search a smaller list.

It occurred to me that this is going through each item in the original wordlist (38,000-odd words) checking the start of each. This seems silly when wordlist is ordered, and the comprehension could stop once it hits something that is after the word fragment. I tried this:

newlist = []
for w in wordlist:
    if w[:len(word_fragment)] > word_fragment:
        # Take advantage of the fact that the list is sorted
        break
    if w.startswith(word_fragment):
        newlist.append(w)
return newlist

but that is about the same speed, which I thought may be because list comprehensions run as compiled code?

I then thought that more efficient again would be some form of binary search in the list to find the block of matching words. Is this the way to go, or am I missing something really obvious?

Clearly it isn't really a big deal in this case, but I'm just starting out with programming and want to do things properly.

UPDATE:

I have since tested the below suggestions with a simple test script. While Peter's binary search/bisect would clearly be better for a single run, I was interested in whether the narrowing list would win over a series of fragments. In fact, it did not:

The totals for all strings "p", "py", "pyt", "pyth", "pytho" are as follows:
In total, Peter's simple test took 0.175472736359
In total, Peter's bisect left test took 9.36985015869e-05
In total, the list comprehension took 0.0499348640442
In total, Neil G's bisect took 0.000373601913452

The overhead of creating a second list etc clearly took more time than searching the longer list. In hindsight, this was likely the best approach regardless, as the "reducing list" approach increased the time for the first run, which was the worst case scenario.

Thanks all for some excellent suggestions, and well done Peter for the best answer!!!

Thank you for your comments, Senthil. See the answer to your comment below. A linear search of a list runs in linear time, whereas binary search runs in log(n) time. This clearly makes a difference on a long word list such as mine. — Hooloovoo, Jul 31 '11 at 05:41

Peter · Accepted Answer · 2011-07-17T19:53:37.650

15

Generator expressions are evaluated lazily, so if you only need to determine whether or not your word is valid, I would expect the following to be more efficient since it doesn't necessarily force it to build the full list once it finds a match:

def word_exists(wordlist, word_fragment):
    return any(w.startswith(word_fragment) for w in wordlist)

Note that the lack of square brackets is important for this to work.

However this is obviously still linear in the worst case. You're correct that binary search would be more efficient; you can use the built-in bisect module for that. It might look something like this:

from bisect import bisect_left
def word_exists(wordlist, word_fragment):
    try:
        return wordlist[bisect_left(wordlist, word_fragment)].startswith(word_fragment)
    except IndexError:
        return False # word_fragment is greater than all entries in wordlist

bisect_left runs in O(log(n)) so is going to be considerably faster for a large wordlist.

Edit: I would guess that the example you gave loses out if your word_fragment is something really common (like 't'), in which case it probably spends most of its time assembling a large list of valid words, and the gain from only having to do a partial scan of the list is negligible. Hard to say for sure, but it's a little academic since binary search is better anyway.

edited Jul 17 '11 at 19:53

answered Jul 17 '11 at 09:59

Peter

7,216
2
34
46

3

well your first example actually is not list comprehension, is a generator, a different thing..you should be more explicit on that – Ant Jul 17 '11 at 10:51
nice bisect solution and easy to adapt if you actually need to return all words starting with the given prefix. – tomasz Jul 17 '11 at 11:00
Also, you could use `any` instead of `True in …` – Neil G Jul 17 '11 at 18:14
Thanks, have updated appropriately. I had a feeling there should be a cleaner way than "True in ..." but couldn't express it clearly enough for a search to work. – Peter Jul 17 '11 at 19:56
Peter The function **word_exists()** says if there is a word beginning with **word_fragment** in **wordlist**, or not. How do you use this function to find the same result as Hooloovoo , whose iteration **for w in wordlist:** produces the list **wordlist** of several words beginning with **word_fragment** ? – eyquem Jul 17 '11 at 21:16
eyquem: You don't :) His description of the problem only says that he needs to know if the list is empty or not, which he is currently doing by building a list and seeing if it is empty. The linked description of the "Ghost" game also doesn't sound like you necessarily need the full list of words. – Peter Jul 18 '11 at 00:23
@Peter I don't agree with you. Your solution is very tricky, but it isn't what Hooloowoo asks for. In my opinion, he does want to narrow a list named **wordlist** down to a new list of the same name but having only the words beginning with the current string named **word_fragment**. It is what he explains in a similar sentence, what his code is aimed at, and what is performed in the page which he gives the link (under 'here' in his post). – eyquem Jul 18 '11 at 07:43
@Peter Yes, he wrote _"I need to determine whether a string of characters is the start of any valid word_ [in] _a list of valid words ("wordlist")"_ , but it's to describe the condition **if w.startswith(word_fragment)** in the comprehension list that reduces **wordlist**, this condition being clearly said to be _"As part of"_. The problem is to know if Hooloowoo really wants to follow the same algorithm as in the other post whose link he gaves, or if his coding of the game follows another algorithm in which he only needs to have an answer True/False for which you solution is excellent. – eyquem Jul 18 '11 at 07:44
Okay. My code answers his question as I read it; maybe it's what he really needs, maybe not. His question is tagged "performance" and I consider that the most important performance enhancement is to figure out exactly what you need to do and what you don't. If he does need the actual words, well, I still think the suggestion of `bisect` was the right one and it is not fantastically hard to adapt it to return them - Neil's answer would appear to do that now, for example. – Peter Jul 19 '11 at 08:04

Neil G · Answer 2 · 2011-07-18T02:25:21.060

4

You're right that you can do this more efficiently given that the list is sorted.

I'm building off of @Peter's answer, which returns a single element. I see that you want all the words that start with a given prefix. Here's how you do that:

from bisect import bisect_left
wordlist[bisect_left(wordlist, word_fragment):
         bisect_left(wordlist, word_fragment[:-1] + chr(ord(word_fragment[-1])+1))]

This returns the slice from your original sorted list.

edited Jul 18 '11 at 02:25

answered Jul 17 '11 at 18:20

Neil G

32,138
39
156
257

No, it doesn't - I did think of that initially. Try it with wordlist=['abcd', 'abce', 'abcf'] and word_fragment='abc' - both bisect_left and bisect_right return 0. – Peter Jul 17 '11 at 21:21
@Neil G This doesn't return the slice of **words beginning with word_fragment** . This returns the slice of words **word_fragment** if there are some in _wordlist_ – eyquem Jul 17 '11 at 21:22

score 1 · Answer 3 · edited Jun 20 '20 at 09:12

As Peter suggested I would use the Bisect module. Especially if you're reading from a large file of words.

If you really need speed you could make a daemon ( How do you create a daemon in Python? ) that has a pre-processed data structure suited for the task

I suggest you could use "tries"

http://www.topcoder.com/tc?module=Static&d1=tutorials&d2=usingTries

There are many algorithms and data structures to index and search strings inside a text, some of them are included in the standard libraries, but not all of them; the trie data structure is a good example of one that isn't.

Let word be a single string and let dictionary be a large set of words. If we have a dictionary, and we need to know if a single word is inside of the dictionary the tries are a data structure that can help us. But you may be asking yourself, "Why use tries if set and hash tables can do the same?" There are two main reasons:

The tries can insert and find strings in O(L) time (where L represent the length of a single word). This is much faster than set , but is it a bit faster than a hash table.

The set and the hash tables can only find in a dictionary words that match exactly with the single word that we are finding; the trie allow us to find words that have a single character different, a prefix in common, a character missing, etc.

The tries can be useful in TopCoder problems, but also have a great amount of applications in software engineering. For example, consider a web browser. Do you know how the web browser can auto complete your text or show you many possibilities of the text that you could be writing? Yes, with the trie you can do it very fast. Do you know how an orthographic corrector can check that every word that you type is in a dictionary? Again a trie. You can also use a trie for suggested corrections of the words that are present in the text but not in the dictionary.

an example would be:

start={'a':nodea,'b':nodeb,'c':nodec...}
nodea={'a':nodeaa,'b':nodeab,'c':nodeac...}
nodeb={'a':nodeba,'b':nodebb,'c':nodebc...}
etc..

then if you want all the words starting with ab you would just traverse start['a']['b'] and that would be all the words you want.

to build it you could iterate through your wordlist and for each word, iterate through the characters adding a new default dict where required.

score 0 · Answer 4 · answered Jul 17 '11 at 10:35

In case of binary search (assuming wordlist is sorted), I'm thinking of something like this:

wordlist = "ab", "abc", "bc", "bcf", "bct", "cft", "k", "l", "m"
fragment = "bc"
a, m, b = 0, 0, len(wordlist)-1
iterations = 0

while True:
    if (a + b) / 2 == m: break # endless loop = nothing found
    m = (a + b) / 2
    iterations += 1
    if wordlist[m].startswith(fragment): break # found word
    if wordlist[m] > fragment >= wordlist[a]: a, b = a, m
    elif wordlist[b] >= fragment >= wordlist[m]: a, b = m, b

if wordlist[m].startswith(fragment):
    print wordlist[m], iterations
else:
    print "Not found", iterations

It will find one matched word, or none. You will then have to look to the left and right of it to find other matched words. My algorithm might be incorrect, its just a rough version of my thoughts.

This is basically the same as using bisect as Peter suggested, I just didn't know such module existed. Using bisect is a better and cleaner way than my suggestion. — Andrey Kuzmin, Jul 17 '11 at 10:48

score 0 · Answer 5 · answered Jul 18 '11 at 13:42

Here's my fastest way to narrow the list wordlist down to a list of valid words starting with a given fragment :

sect() is a generator function that uses the excellent Peter's idea to employ bisect, and the islice() function :

from bisect import bisect_left
from itertools import islice

from time import clock

A,B = [],[]


iterations = 5
repetition = 10

with open('words.txt') as f:
    wordlist = f.read().split()

wordlist.sort()
print 'wordlist[0:10]==',wordlist[0:10]


def sect(wordlist,word_fragment):
    lgth = len(word_fragment)
    for w in islice(wordlist,bisect_left(wordlist, word_fragment),None):
        if w[0:lgth]==word_fragment:
            yield w
        else:
            break


def hooloo(wordlist,word_fragment):
    usque = len(word_fragment)
    for w in wordlist:
        if w[:usque] > word_fragment:
            break
        if w.startswith(word_fragment):
            yield w


for rep in xrange(repetition):
    te = clock()
    for i in xrange(iterations):
        newlistA = list(sect(wordlist,'VEST'))
    A.append(clock()-te)

    te = clock()
    for i in xrange(iterations):
        newlistB = list(hooloo(wordlist,'VEST'))
    B.append(clock() - te)


print '\niterations =',iterations,'   number of tries:',repetition,'\n'
print newlistA,'\n',min(A),'\n'
print newlistB,'\n',min(B),'\n'

result

wordlist[0:10]== ['AA', 'AAH', 'AAHED', 'AAHING', 'AAHS', 'AAL', 'AALII', 'AALIIS', 'AALS', 'AARDVARK']

iterations = 5    number of tries: 30 

['VEST', 'VESTA', 'VESTAL', 'VESTALLY', 'VESTALS', 'VESTAS', 'VESTED', 'VESTEE', 'VESTEES', 'VESTIARY', 'VESTIGE', 'VESTIGES', 'VESTIGIA', 'VESTING', 'VESTINGS', 'VESTLESS', 'VESTLIKE', 'VESTMENT', 'VESTRAL', 'VESTRIES', 'VESTRY', 'VESTS', 'VESTURAL', 'VESTURE', 'VESTURED', 'VESTURES'] 
0.0286089433154 

['VEST', 'VESTA', 'VESTAL', 'VESTALLY', 'VESTALS', 'VESTAS', 'VESTED', 'VESTEE', 'VESTEES', 'VESTIARY', 'VESTIGE', 'VESTIGES', 'VESTIGIA', 'VESTING', 'VESTINGS', 'VESTLESS', 'VESTLIKE', 'VESTMENT', 'VESTRAL', 'VESTRIES', 'VESTRY', 'VESTS', 'VESTURAL', 'VESTURE', 'VESTURED', 'VESTURES'] 
0.415578236899

sect() is 14.5 times faster than holloo()

PS:

I know the existence of timeit, but here, for such a result, clock() is fully sufficient

score -3 · Answer 6 · answered Jul 17 '11 at 09:48

Doing binary search in the list is not going to guarantee you anything. I am not sure how that would work either.

You have a list which is ordered, it is a good news. The algorithmic performance complexity of both your cases is O(n) which is not bad, that you just have to iterate through the whole wordlist once.

But in the second case, the performance (engineering performance) should be better because you are breaking as soon as you find that rest cases will not apply. Try to have a list where 1st element is match and rest 38000 - 1 elements do not match, you will the second will beat the first.

assuming each word in the list has a 1/1000 chance of matching, then you expect on average to check about at least a few hundred words without use of binary search. with use of binary search you expect about log(n) checks (I'm assuming n>>1000) — Rusty Rob, Jul 17 '11 at 10:57

Fastest way in Python to find a 'startswith' substring in a long sorted list of strings

UPDATE:

6 Answers6