1

I have two lists, one of words, and another of character combinations. What would be the fastest way to only return the combinations that don't match anything in the list?

I've tried to make it as streamlined as possible, but it's still very slow when it uses 3 characters for the combinations (goes up to 290 seconds for 4 characters, not even going to try 5)

Here's some example code, currently I'm converting all the words to a list, and then searching the string for each list value.

#Sample of stuff
allCombinations = ["a","aa","ab","ac","ad"]
allWords = ["testing", "accurate" ]

#Do the calculations
allWordsJoined = ",".join( allWords )
invalidCombinations = set( i for i in allCombinations if i not in allWordsJoined )

print invalidCombinations
#Result: set(['aa', 'ab', 'ad'])

I'm just curious if there's a better way to do this with sets? With a combination of 3 letters, there are 18278 list items to search for, and for 4 letters, that goes up to 475254, so currently my method isn't really fast enough, especially when the word list string is about 1 million characters.

Set.intersection seems like a very useful method if you need the whole string, so surely there must be something similar to search for a substring.

Peter
  • 3,186
  • 3
  • 26
  • 59

2 Answers2

1

The first thing that comes to mind is that you can optimize lookup by checking current combination against combinations that are already "invalid". I.e. if ab is invalid, than ab.? will be invalid too and there's no point to check such.

And one more thing: try using

for i in allCombinations:
    if i not in allWordsJoined:
        invalidCombinations.add(i)

instead of

invalidCombinations = set(i for i in allCombinations if i not in allWordsJoined)

I'm not sure, but less memory allocations can be a small boost for real data run.

Slam
  • 8,112
  • 1
  • 36
  • 44
  • why do you think there are fewer memory allocations? – acushner Feb 03 '15 at 18:12
  • Ahh, thanks for the idea but the speed is still very similar, both result in around 7.3-7.8 seconds when using the 3 levels of combinations. As to your first point, I thought about that, but adding code to stop such checks could potentially slow it down more than it speeds it up, and was a bit hard to work into my 1 line way :P – Peter Feb 03 '15 at 19:10
  • Peter, I wouldn't add checks. I would look at different data structures that would eliminate the need to check. – Ryan O'Donnell Feb 03 '15 at 19:30
0

Seeing if a set contains an item is O(1). You would still have to iterate through your list of combinations (with some exceptions. If your word doesn't have "a" it's not going to have any other combinations that contain "a". You can use some tree-like data structure for this) to compare with your original set of words.

You shouldn't convert your wordlist to a string, but rather a set. You should get O(N) where N is the length of your combinations.

Also, I like Python, but it isn't the fastest of languages. If this is the only task you need to do, and it needs to be very fast, and you can't improve the algorithm, you might want to check out other languages. You should be able to very easily prototype something to get an idea of the difference in speed for different languages.

Ryan O'Donnell
  • 617
  • 6
  • 14
  • Thanks, I'm unsure as to how to iterate through it as a set though. I tried `set( i for i in allCombinations if i in set( k for k in allWords) )` if that's what you meant, but it's still executing after a few minutes, which is way slower than the 7ish seconds from the other method. – Peter Feb 03 '15 at 19:20
  • Try: > set(allwords) I would still highly suggest you use some different data structure for your allCombinations. Binary search is O(log N) which is really really awesome! – Ryan O'Donnell Feb 03 '15 at 19:27
  • Sorry, what you're suggesting is a little more advanced than my current python knowledge haha, do you mean doing it like `allCombinations["a"]["a"]["b"] = None` or something, where it'll recursively search inwards until a match isn't found? Here's the code I have currently for coming up with the list - http://pastebin.com/cN6mjNiQ – Peter Feb 03 '15 at 20:21
  • It is definitely more complicated! I would first try another language. Prototype it in Go (or even try pypy!) – Ryan O'Donnell Feb 04 '15 at 12:18
  • Ah ok thanks, to be fair this is more of a personal idea I'm working on so it's not vital I squeeze out every last ounce of speed possible, I'm just trying to get more advanced with Python as it's the language maya supports :) – Peter Feb 04 '15 at 14:49
  • I'm slightly better at thinking logically than with code haha, so I always tend to try the maths way around things, which quite often is far from the fastest ;p – Peter Feb 04 '15 at 14:51