2

I want to create a hash: keys will be first met different words, the value is a reference to anon array with anagrams of this key.

I wrote the following code, which works when the amount of words is in range (0 to a few thousand) but I want to make it faster.

Below is my way to check if a string is an anagram. I iterate for all words and for all keys of the hash.

join("", sort(split (//, fc($word)))) eq join("", sort(split (//, fc($key)))))
Dave Cross
  • 68,119
  • 3
  • 51
  • 97
  • The obvious enhancement is to perform `my $key = join( "", sort( split (//, fc($word) ) ) )` only once. That should make a big difference. On top of that you can prefilter the words in the dictionary to select only those words that (say) begin with two of the letters in the original word. – Borodin Mar 25 '18 at 14:28
  • Finding anagrams is a variation on the "N-letter" problem (what N letters spell the most other words of N or fewer letters). One of the most efficient solutions is to create a bit histogram where the first 32bit word represents a-z, the second layer represents duplicates of a-z... you seldom have to look past the 4th layer (which would indicate the same letter appears 4 times). By converting every word in the dictionary to its bit histogram you can seek matches using integer & comparisons, and reject at the earliest possible layer. – DavidO Mar 25 '18 at 15:43
  • can you give an example how it should look? – D.123perl456 Mar 25 '18 at 16:02
  • Apple becomes 10001000000100000000000000000000, 00000000000000010000000000000000. All words with one a, one e, one l, and two p`s will have the same pattern. – DavidO Mar 25 '18 at 16:16
  • Maybe you would prefer a straightforward and easy t reason about approach that is reasonably performant instead of the most performant but harder to reason about. – DavidO Mar 25 '18 at 16:19
  • where I can read about it? – D.123perl456 Mar 25 '18 at 16:23
  • @DavidO: Using a string formed from the sorted letters in the original word will provide a similar key. The OP has already done this, and it produces shorter keys than your proposal without any disadvantages that I can see. Perhaps if the requirement were to find words that use a *subset* of the original word then bitmaps would be more useful, but that is not the OP's requirement. – Borodin Mar 25 '18 at 16:42
  • 1
    Possible duplicate of [Algorithm to generate anagrams](https://stackoverflow.com/questions/55210/algorithm-to-generate-anagrams) – Gilles 'SO- stop being evil' Mar 25 '18 at 17:12

1 Answers1

1

O(N).

my %grouped;
while (<>) {
   chomp;
   my $key = sort split //, fc($_);
   push @{ $grouped{$key} }, $_;
}

my @anagrams = grep { @$_ >= 2 } values(%grouped);
ikegami
  • 367,544
  • 15
  • 269
  • 518