4

I have a dictionary of 200,000 words and a set of letters. I need an algorithm to check if all the letters of a word are in that set of letters. It's very slow to check the words one by one. Because there is a huge number of words to process, I need a data structure to do this. Any ideas? Thanks!

For example: I have a set of letters {b,g,e,f,t,u,i,t,g,n,c,m,m,w,c,s}, I wanna check if word "big" and "buff". All letters of "big" are a subset of the original set then "big" is what i want while "buff" is not what i want because there is only one "f" in the original set. This is what i wanna do.

AaronTry
  • 53
  • 5
  • Why not utilize multithreading? –  Jun 21 '13 at 02:01
  • 3
    a trie? http://en.wikipedia.org/wiki/Trie – nachokk Jun 21 '13 at 02:03
  • This look like a case for a Trie data structure (as @nachokk suggests). – Luiggi Mendoza Jun 21 '13 at 02:06
  • 1
    @nachokk- That certainly helps, but it's tricky to use it correctly because you might have to test for all permutations of the letters you have. – templatetypedef Jun 21 '13 at 02:09
  • Can you elaborate a bit more on your setup? Do you know anything about the number / distribution of letters or how you're going to use them? – templatetypedef Jun 21 '13 at 02:10
  • @templatetypedef You don't have to account for all permutations. See my answer. – paddy Jun 21 '13 at 02:13
  • Hey Aaron, I just saw your edit. Seems like you're doing the problem in reverse. I thought you wanted to find all possible words that can be made from a set of letters, but now it seems you want to test if individual words can be made from a set of letters. If that's what you want, then the problem is so much easier than the solution I described. – paddy Jun 21 '13 at 02:40

3 Answers3

7

This is for something like Scrabble or Boggle, right? Well, what you do is pre-generate your dictionary by sorting the letters in each word. So, word becomes dorw. Then you shove all these into a Trie data structure. So, in your Trie, the sequence dorw would point to the value word.

[Note that because we sorted the words, they lose their uniqueness, so one sorted word can point to multiple different words. ie your Trie needs to store a list or array at its data nodes]

You can save this structure out if you need to load it quickly later without all the sorting steps.

What you then do is take your input letters and you sort them too. You then start walking through your Trie recursively. If the current letter matches an existing path in the Trie, you follow it. Because you can have unused letter, you also allow the current letter to be dropped.

And it's that simple. Any time you encounter a node in your Trie that has a value, that's a word that you can make out of the letters you used to get there. You just add these words to a list as you find them, and when the recursion is done you have found every possible word.

If you have repeated letters in your input, you may need extra logic to prevent multiple instances of the same word being given (unless you want that). That logic can be invoked during the step that 'leaves out' a letter (you just skip past all the repeated letters) to the next letter.


[edit] You seem to want to do the opposite. My solution above finds all possible words that can be made from a set of letters. But you want to test a specific word to see if it's allowed, given the set of letters you have.

This is simple.

Store your available letters as a histogram. That is, for each letter, you store the number that you have. Then, you walk through each letter in your test word, building a new histogram as you go. As soon as one of your histogram buckets exceeds the value in your available-letters, the word cannot be made. If you get all the way to the end, you can successfully make the word.

paddy
  • 60,864
  • 6
  • 61
  • 103
  • This reduces searching through permutations to searching through subsets. That's definitely faster, though I suspect there's a better way to do this. – templatetypedef Jun 21 '13 at 02:14
  • @paddy- Isn't this 2^n, at least? For each character, you have to decide whether to use that character (and descend down one branch) or skip that character (and descend down another). This gives two independent choices for each of the n letters, so I thought it would work out to 2^n total options. Or am I misinterpreting your answer? – templatetypedef Jun 21 '13 at 02:19
  • 1
    Hmmm, oh yes I think you're right, and I need coffee! =) Still, the sparseness of the Trie should make performance much better. I remember solving this problem ages ago and can't remember where. Maybe it was in MatLab forums. I should track that down and see if I'm describing what I actually did. – paddy Jun 21 '13 at 02:21
  • I found that question I was thinking of, right here in fact: http://stackoverflow.com/a/14361817/1553090 ... It's not quite the same problem as this, however. – paddy Jun 21 '13 at 02:30
  • Yeah, it's like scrabble but I wanna check if the word belongs to that set, if so print out that word. I don't know why i need to pre-generate each word. – AaronTry Jun 21 '13 at 02:39
  • @AaronTry I have edited my answer to show the reverse approach. – paddy Jun 21 '13 at 02:46
  • @paddy After I store each word in a histogram, how can I compare two histogram in java? Is there a method in java that I can use? – AaronTry Jun 27 '13 at 00:13
  • You don't need to compare the histograms. You just build it and throw it away after. While building, if at any stage one letter count exceeds the count for the same letter in your source word's histogram, then the word cannot be made. – paddy Jun 27 '13 at 01:16
0

You can use an array to mark the letter set. Each element in the array stands for a letter. To convert the letter to the element position, just need to subtract the ASCII code of 'a' or 'A'. Then the first element stands for 'a', then the second is 'b', and so on. Then the 27th is 'A'. The element value stands for the occurrences. For example, the array {2, 0, 1, 0, ...} stands for like {a, c, a}. The pseudo code could be:

    for each word
        copy the array to a new one
        for each letter in the word
            get the element position of the letter: position = letter - 'a'
            decrease the element value in the new array by one: new_array[position]--
            if the value is negative, return not found: if array[position] < 0  {return not found;}
Pu Gong
  • 176
  • 3
  • Assume the average length of a word is 7 and the number of letters in the set is 18.It takes average 7*log(18) to compare a word if I convert 18 letters to numbers and put them in a binary search tree. 200,000 word will cost a lot of time. – AaronTry Jun 21 '13 at 03:18
  • Sorry, I didn't understand the log(18). I mean the array is fixed and always 52 length if you are assuming only English letters. To get the element value of a letter, just array[letter - 'a']. This will be very fast because it is not a sort of search. – Pu Gong Jun 21 '13 at 05:59
  • log(n) is the search time for BST. – AaronTry Jun 23 '13 at 06:11
0

sort the set, then sort each word and do a "merge"-like operation

necromancer
  • 23,916
  • 22
  • 68
  • 115