5

Let's say that we have a dictionary of about 250.000 words. Algorithm should take in 12 letters as an array or a string and find the variation that matches longest word from a dictionary.

Of course, one can always brute-force it, but I wonder what would be the most elegant way to do this?

Answers using languages other than PHP will also be accepted if they do not use any language-specific functions as a shortcut for the main problem.

Note: Words are stored in the database, but I could pull them into memory for speed. Although I'm not sure PHP's indexing is better than that of an MySQL database?

Milan Babuškov
  • 59,775
  • 49
  • 126
  • 179

5 Answers5

4

You should calculate the signature of every word, you do it only once and save it into your database along with the word.

The table should be something like this:

   word varchar(12), 
   a int,
   b int, 
   c int,
    ...
   w int,
   z int;

and the fields from a to z have to contains the number of letter contained in the word, for example anagram would have a record like:

word,    a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z
anagram, 3,0,0,0,0,0,1,0,0,0,0,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0

once you have the twelve letters you have to calculate the signature of the set and use it to create a select like this:

select word, length(word) as wordlen
from dictionary
where
a <= 4 and
b <= 0 and
c <= 1 and
d <= 2 and
e <= 0 and
f <= 0 and
 ....
z <= 0
order by wordlen desc;

in order to have all the word that can be created using the letter set you have.

No permutation, no combination and the though work (compiling the dictionary) is done only once and offline.

Just another hint, strip from the database all the words that are longer than twelve chars

Milan Babuškov
  • 59,775
  • 49
  • 126
  • 179
Eineki
  • 14,773
  • 6
  • 50
  • 59
  • I like this, although it seems to require to scan through entire set when searching for longest word. I was hoping for an algorithm that would allow me to use indexes. – Milan Babuškov Sep 28 '09 at 13:59
  • You can index each of the a-z fields. Though that may take up a lot of space. – DisgruntledGoat Sep 28 '09 at 16:33
  • Can you smash the characters used with an amount into another column (called "hash") and index that. For "anagram" you would have a hash column of "a3g1m1n1r1" (or you could do the "aaagmnr" hash as well) – null Sep 28 '09 at 18:30
  • If you use the signature/hash of a word (aaegmnr for manager) you can't retrieve easily the subwords like, for example, game o gamer with just a sql query – Eineki Sep 28 '09 at 18:41
  • Okay, I see... it's why you have the "... a <= 4 ..." part of the query. So if that's the case, then could you just hash the characters available, so that way you can index that? So the hash would be... "agmnr"? – null Sep 28 '09 at 20:01
3

I'd go with a slightly modified version of the answer to the anagram question here

For each word in the dictionary, sort the letters alphabetically. So "foobar" becomes "abfoor."

Start with your complete input, alphabetically sorted. If its not found, remove one letter, do the search again. Do this for every letter. Then remove two letters... and so on.

Worst case: No 'anagram' found at all. You will have to test all possible input combinations, which will give you around 2^n lookups where n is the number of input characters (in your example: 12) However, the speed of the algorithm does not depend on the size of the dictionary at run time (of course, sorting the words alphabetically does) which in my opinion is the most important thing here.

Community
  • 1
  • 1
HerdplattenToni
  • 486
  • 2
  • 5
1

Eric Lippert has written an informative blog post about anagram searching. The examples all use c#, but the techniques are usable in any language.

The trick to efficiently searching for anagrams in a dictionary is to realize that all anagrams have the same letters, just in different order. If you "canonicalize" each word so that its letters are uppercase and in alphabetical order, then checking whether one word is an anagram of another is as simple as comparing their canonical forms

With this technique, you can easily look up anagrams from a hash table or balanced tree.

Sean Reilly
  • 21,526
  • 4
  • 48
  • 62
0

If you are trying to find the longest matching word, I would start by trying to sort the dictionary by word length, so you can focus the most effort on the longest words

Rik Heywood
  • 13,816
  • 9
  • 61
  • 81
-1

My idea:

pseudocode:

int_32 letter_mask
int_32 permutation_match_mask
if(((letter_mask XOR permutation_match_mask) AND letter_mask)  == 0)
        YOU_HAVE_HIT;

well this works when you have non repetive letters in lettermask, but if you have more letters (as you probably have) than you can extend leter and permutationmatchmask

EDIT

Another idea

Sort words in vocabulary by alphabeticaly order.

if there are 12 letteres and all of them are different, than there are exactly 4095 posible cobinations (just sum i= 1->12 binomial(12 over i) ) (for letters ABCD, there are (ABCD,ABC,ABD,ACD,BCD,AB,AC,AD,BC,BD,CD,A,B,C,D) And as I said there are 4095 in 12 different letters and even less if some of letters are same.

Complexity 4095*Log2(250000) what is aproximetly 75000. Well it is worth to try.

Go for exact search on each combination.

Luka Rahne
  • 10,336
  • 3
  • 34
  • 56
  • Care to explain in little more detail? – Milan Babuškov Sep 28 '09 at 11:01
  • It is brute force algorhitm and need to check each word separatly to find HIT. For instance: You have letters "ABFR" and 2 words in dictionary "FOO" and "BAR" bar is represented whit "11000000000000000100000000000000"binary (we have A at 1, B at 2 and R at 18) "arbf" has "11000100000000000100000000000000"binary when you do upper logical evaluantion which need in this case just few instructions it yeilds YOU_HAVE_HIT But as I told it is just fast brute force implementation. – Luka Rahne Sep 28 '09 at 11:17