Algorithm to find longest anagram

Question

Let's say that we have a dictionary of about 250.000 words. Algorithm should take in 12 letters as an array or a string and find the variation that matches longest word from a dictionary.

Of course, one can always brute-force it, but I wonder what would be the most elegant way to do this?

Answers using languages other than PHP will also be accepted if they do not use any language-specific functions as a shortcut for the main problem.

Note: Words are stored in the database, but I could pull them into memory for speed. Although I'm not sure PHP's indexing is better than that of an MySQL database?

You should read up on tries. http://en.wikipedia.org/wiki/Trie — FogleBird, Sep 29 '09 at 03:06

score 4 · Answer 1 · edited Sep 28 '09 at 21:08

4

You should calculate the signature of every word, you do it only once and save it into your database along with the word.

The table should be something like this:

   word varchar(12), 
   a int,
   b int, 
   c int,
    ...
   w int,
   z int;

and the fields from a to z have to contains the number of letter contained in the word, for example anagram would have a record like:

word,    a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z
anagram, 3,0,0,0,0,0,1,0,0,0,0,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0

once you have the twelve letters you have to calculate the signature of the set and use it to create a select like this:

select word, length(word) as wordlen
from dictionary
where
a <= 4 and
b <= 0 and
c <= 1 and
d <= 2 and
e <= 0 and
f <= 0 and
 ....
z <= 0
order by wordlen desc;

in order to have all the word that can be created using the letter set you have.

No permutation, no combination and the though work (compiling the dictionary) is done only once and offline.

Just another hint, strip from the database all the words that are longer than twelve chars

edited Sep 28 '09 at 21:08

Milan Babuškov

59,775
49
126
179

answered Sep 28 '09 at 12:37

Eineki

14,773
6
50
59

I like this, although it seems to require to scan through entire set when searching for longest word. I was hoping for an algorithm that would allow me to use indexes. – Milan Babuškov Sep 28 '09 at 13:59
You can index each of the a-z fields. Though that may take up a lot of space. – DisgruntledGoat Sep 28 '09 at 16:33
Can you smash the characters used with an amount into another column (called "hash") and index that. For "anagram" you would have a hash column of "a3g1m1n1r1" (or you could do the "aaagmnr" hash as well) – null Sep 28 '09 at 18:30
If you use the signature/hash of a word (aaegmnr for manager) you can't retrieve easily the subwords like, for example, game o gamer with just a sql query – Eineki Sep 28 '09 at 18:41
Okay, I see... it's why you have the "... a <= 4 ..." part of the query. So if that's the case, then could you just hash the characters available, so that way you can index that? So the hash would be... "agmnr"? – null Sep 28 '09 at 20:01

score 3 · Accepted Answer · edited May 23 '17 at 12:19

3

I'd go with a slightly modified version of the answer to the anagram question here

For each word in the dictionary, sort the letters alphabetically. So "foobar" becomes "abfoor."

Start with your complete input, alphabetically sorted. If its not found, remove one letter, do the search again. Do this for every letter. Then remove two letters... and so on.

Worst case: No 'anagram' found at all. You will have to test all possible input combinations, which will give you around 2^n lookups where n is the number of input characters (in your example: 12) However, the speed of the algorithm does not depend on the size of the dictionary at run time (of course, sorting the words alphabetically does) which in my opinion is the most important thing here.

edited May 23 '17 at 12:19

Community

1
1

answered Sep 28 '09 at 12:17

HerdplattenToni

486
2
5

Presumably a typo, but just to check: did you *mean* 'acfoor' or **abfoor** ? – David Thomas Sep 28 '09 at 18:28
I actually copied that part from the original answer, but you are right of course – HerdplattenToni Sep 29 '09 at 07:59

score 1 · Answer 3 · answered Sep 29 '09 at 15:16

1

Eric Lippert has written an informative blog post about anagram searching. The examples all use c#, but the techniques are usable in any language.

The trick to efficiently searching for anagrams in a dictionary is to realize that all anagrams have the same letters, just in different order. If you "canonicalize" each word so that its letters are uppercase and in alphabetical order, then checking whether one word is an anagram of another is as simple as comparing their canonical forms

With this technique, you can easily look up anagrams from a hash table or balanced tree.

answered Sep 29 '09 at 15:16

Sean Reilly

21,526
4
48
62

Isn't that the same as HerdplattenToni's answer? – Milan Babuškov Sep 29 '09 at 18:10
It's certainly similar, but the blog post includes lots of practical tuning advice, and it's worth mentioning. – Sean Reilly Sep 29 '09 at 21:54

score 0 · Answer 4 · answered Sep 28 '09 at 10:22

0

If you are trying to find the longest matching word, I would start by trying to sort the dictionary by word length, so you can focus the most effort on the longest words

answered Sep 28 '09 at 10:22

Rik Heywood

13,816
9
61
81

Luka Rahne · Answer 5 · 2009-09-30T00:26:36.363

-1

My idea:

pseudocode:

int_32 letter_mask
int_32 permutation_match_mask
if(((letter_mask XOR permutation_match_mask) AND letter_mask)  == 0)
        YOU_HAVE_HIT;

well this works when you have non repetive letters in lettermask, but if you have more letters (as you probably have) than you can extend leter and permutationmatchmask

EDIT

Another idea

Sort words in vocabulary by alphabeticaly order.

if there are 12 letteres and all of them are different, than there are exactly 4095 posible cobinations (just sum i= 1->12 binomial(12 over i) ) (for letters ABCD, there are (ABCD,ABC,ABD,ACD,BCD,AB,AC,AD,BC,BD,CD,A,B,C,D) And as I said there are 4095 in 12 different letters and even less if some of letters are same.

Complexity 4095*Log2(250000) what is aproximetly 75000. Well it is worth to try.

Go for exact search on each combination.

edited Sep 30 '09 at 00:26

answered Sep 28 '09 at 10:55

Luka Rahne

10,336
3
34
56

Care to explain in little more detail? – Milan Babuškov Sep 28 '09 at 11:01
It is brute force algorhitm and need to check each word separatly to find HIT. For instance: You have letters "ABFR" and 2 words in dictionary "FOO" and "BAR" bar is represented whit "11000000000000000100000000000000"binary (we have A at 1, B at 2 and R at 18) "arbf" has "11000100000000000100000000000000"binary when you do upper logical evaluantion which need in this case just few instructions it yeilds YOU_HAVE_HIT But as I told it is just fast brute force implementation. – Luka Rahne Sep 28 '09 at 11:17

Algorithm to find longest anagram

5 Answers5