I've done a lot of Googling, but haven't found anything, so I'm really sorry if I'm just searching for the wrong things.
I am writing an implementation of the Ghost for MIT Introduction to Programming, assignment 5.
As part of this, I need to determine whether a string of characters is the start of any valid word. I have a list of valid words ("wordlist").
Update: I could use something that iterated through the list each time, such as Peter's simple suggestion:
def word_exists(wordlist, word_fragment):
return any(w.startswith(word_fragment) for w in wordlist)
I previously had:
wordlist = [w for w in wordlist if w.startswith(word_fragment)]
(from here) to narrow the list down to the list of valid words that start with that fragment and consider it a loss if wordlist is empty. The reason that I took this approach was that I (incorrectly, see below) thought that this would save time, as subsequent lookups would only have to search a smaller list.
It occurred to me that this is going through each item in the original wordlist (38,000-odd words) checking the start of each. This seems silly when wordlist is ordered, and the comprehension could stop once it hits something that is after the word fragment. I tried this:
newlist = []
for w in wordlist:
if w[:len(word_fragment)] > word_fragment:
# Take advantage of the fact that the list is sorted
break
if w.startswith(word_fragment):
newlist.append(w)
return newlist
but that is about the same speed, which I thought may be because list comprehensions run as compiled code?
I then thought that more efficient again would be some form of binary search in the list to find the block of matching words. Is this the way to go, or am I missing something really obvious?
Clearly it isn't really a big deal in this case, but I'm just starting out with programming and want to do things properly.
UPDATE:
I have since tested the below suggestions with a simple test script. While Peter's binary search/bisect would clearly be better for a single run, I was interested in whether the narrowing list would win over a series of fragments. In fact, it did not:
The totals for all strings "p", "py", "pyt", "pyth", "pytho" are as follows:
In total, Peter's simple test took 0.175472736359
In total, Peter's bisect left test took 9.36985015869e-05
In total, the list comprehension took 0.0499348640442
In total, Neil G's bisect took 0.000373601913452
The overhead of creating a second list etc clearly took more time than searching the longer list. In hindsight, this was likely the best approach regardless, as the "reducing list" approach increased the time for the first run, which was the worst case scenario.
Thanks all for some excellent suggestions, and well done Peter for the best answer!!!