What is the most efficient algorithm to find anagrams from a dictionary?

Question

I want to create a hash: keys will be first met different words, the value is a reference to anon array with anagrams of this key.

I wrote the following code, which works when the amount of words is in range (0 to a few thousand) but I want to make it faster.

Below is my way to check if a string is an anagram. I iterate for all words and for all keys of the hash.

join("", sort(split (//, fc($word)))) eq join("", sort(split (//, fc($key)))))

The obvious enhancement is to perform `my $key = join( "", sort( split (//, fc($word) ) ) )` only once. That should make a big difference. On top of that you can prefilter the words in the dictionary to select only those words that (say) begin with two of the letters in the original word. — Borodin, Mar 25 '18 at 14:28
Finding anagrams is a variation on the "N-letter" problem (what N letters spell the most other words of N or fewer letters). One of the most efficient solutions is to create a bit histogram where the first 32bit word represents a-z, the second layer represents duplicates of a-z... you seldom have to look past the 4th layer (which would indicate the same letter appears 4 times). By converting every word in the dictionary to its bit histogram you can seek matches using integer & comparisons, and reject at the earliest possible layer. — DavidO, Mar 25 '18 at 15:43
Apple becomes 10001000000100000000000000000000, 00000000000000010000000000000000. All words with one a, one e, one l, and two p`s will have the same pattern. — DavidO, Mar 25 '18 at 16:16
Maybe you would prefer a straightforward and easy t reason about approach that is reasonably performant instead of the most performant but harder to reason about. — DavidO, Mar 25 '18 at 16:19
@DavidO: Using a string formed from the sorted letters in the original word will provide a similar key. The OP has already done this, and it produces shorter keys than your proposal without any disadvantages that I can see. Perhaps if the requirement were to find words that use a *subset* of the original word then bitmaps would be more useful, but that is not the OP's requirement. — Borodin, Mar 25 '18 at 16:42
Possible duplicate of [Algorithm to generate anagrams](https://stackoverflow.com/questions/55210/algorithm-to-generate-anagrams) — Gilles 'SO- stop being evil', Mar 25 '18 at 17:12

ikegami · Answer 1 · 2018-03-26T02:48:00.933

1

O(N).

my %grouped;
while (<>) {
   chomp;
   my $key = sort split //, fc($_);
   push @{ $grouped{$key} }, $_;
}

my @anagrams = grep { @$_ >= 2 } values(%grouped);

edited Mar 26 '18 at 02:48

answered Mar 25 '18 at 20:51

ikegami

367,544
15
269
518

also suits function sort which goes over dictionary and have block in where we can fill our hash. It works very fast. I'm surprised. – D.123perl456 Mar 25 '18 at 21:46
1

I think you mean to do: `my $key = join '', sort split //, CORE::fc($_); ` – xxfelixxx Mar 26 '18 at 01:59
`fc` vs `CORE::fc` depending on if your perl version is >= 5.16 – xxfelixxx Mar 26 '18 at 02:00
1

@xxfelixxx, `CORE::fc` or `use feature qw( fc );`, your pick – ikegami Mar 26 '18 at 02:38

What is the most efficient algorithm to find anagrams from a dictionary?

1 Answers1