56

What would be the best strategy to generate anagrams.

An anagram is a type of word play, the result of rearranging the letters
of a word or phrase to produce a new  word or phrase, using all the original
letters exactly once; 
ex.
  • Eleven plus two is anagram of Twelve plus one
  • A decimal point is anagram of I'm a dot in place
  • Astronomers is anagram of Moon starers

At first it looks straightforwardly simple, just to jumble the letters and generate all possible combinations. But what would be the efficient approach to generate only the words in dictionary.

I came across this page, Solving anagrams in Ruby.

But what are your ideas?

prakash
  • 58,901
  • 25
  • 93
  • 115
  • 2
    *settles back in anticipation*..! If you need the output to be a clue for the original phrase, I don't really see how you could 'generate' it. Surely all you can do is generate a list of phrases/anagram pairings and pick from them? How could an algorithm understand astronomers=moon starers, eg? – robsoft Sep 10 '08 at 20:23
  • 1
    Of course generating GOOD anagrams is a hard problem, but generating bad anagrams is easier :) – Vinko Vrsalovic Sep 10 '08 at 20:30

14 Answers14

47

Most of these answers are horribly inefficient and/or will only give one-word solutions (no spaces). My solution will handle any number of words and is very efficient.

What you want is a trie data structure. Here's a complete Python implementation. You just need a word list saved in a file named words.txt You can try the Scrabble dictionary word list here:

http://www.isc.ro/lists/twl06.zip

MIN_WORD_SIZE = 4 # min size of a word in the output

class Node(object):
    def __init__(self, letter='', final=False, depth=0):
        self.letter = letter
        self.final = final
        self.depth = depth
        self.children = {}
    def add(self, letters):
        node = self
        for index, letter in enumerate(letters):
            if letter not in node.children:
                node.children[letter] = Node(letter, index==len(letters)-1, index+1)
            node = node.children[letter]
    def anagram(self, letters):
        tiles = {}
        for letter in letters:
            tiles[letter] = tiles.get(letter, 0) + 1
        min_length = len(letters)
        return self._anagram(tiles, [], self, min_length)
    def _anagram(self, tiles, path, root, min_length):
        if self.final and self.depth >= MIN_WORD_SIZE:
            word = ''.join(path)
            length = len(word.replace(' ', ''))
            if length >= min_length:
                yield word
            path.append(' ')
            for word in root._anagram(tiles, path, root, min_length):
                yield word
            path.pop()
        for letter, node in self.children.iteritems():
            count = tiles.get(letter, 0)
            if count == 0:
                continue
            tiles[letter] = count - 1
            path.append(letter)
            for word in node._anagram(tiles, path, root, min_length):
                yield word
            path.pop()
            tiles[letter] = count

def load_dictionary(path):
    result = Node()
    for line in open(path, 'r'):
        word = line.strip().lower()
        result.add(word)
    return result

def main():
    print 'Loading word list.'
    words = load_dictionary('words.txt')
    while True:
        letters = raw_input('Enter letters: ')
        letters = letters.lower()
        letters = letters.replace(' ', '')
        if not letters:
            break
        count = 0
        for word in words.anagram(letters):
            print word
            count += 1
        print '%d results.' % count

if __name__ == '__main__':
    main()

When you run the program, the words are loaded into a trie in memory. After that, just type in the letters you want to search with and it will print the results. It will only show results that use all of the input letters, nothing shorter.

It filters short words from the output, otherwise the number of results is huge. Feel free to tweak the MIN_WORD_SIZE setting. Keep in mind, just using "astronomers" as input gives 233,549 results if MIN_WORD_SIZE is 1. Perhaps you can find a shorter word list that only contains more common English words.

Also, the contraction "I'm" (from one of your examples) won't show up in the results unless you add "im" to the dictionary and set MIN_WORD_SIZE to 2.

The trick to getting multiple words is to jump back to the root node in the trie whenever you encounter a complete word in the search. Then you keep traversing the trie until all letters have been used.

Sam Mussmann
  • 5,883
  • 2
  • 29
  • 43
FogleBird
  • 74,300
  • 25
  • 125
  • 131
  • Only 16818 anagrams for astronomers with word length 1 from my anagram program, as it does not give out permutations. Running time around 2 s to produce the results with my AMD Sempron humble computer. I save the results to file, it is more usefull than flood of words to text console. I do not use tree structures but plain text with recursion matching the keys from dictionary hashed with sorted letters keys. – Tony Veijalainen Feb 22 '11 at 19:29
  • I have posted my previously code in DaniWeb as http://www.daniweb.com/software-development/python/code/393153/multiword-anagrams-by-recursive-generator. – Tony Veijalainen May 18 '12 at 08:40
  • 2
    Bug report: If the wordlist has two entries: "foobar" and "foob" (in that order), then the code snippet won't find an anagram for "boof". Only if you reverse the order of the wordlist, then it correctly returns "foob". I think this can be fixed by putting another `if` clause into the very first `for` loop, but I'll have to leave that to someone who knows Python. – Martin J.H. Nov 12 '13 at 15:35
  • Could you describe your algorithm in a couple of sentences? What I am particularly interested in is what happens after you decide that a certain dictionary word can be composed using some letters of the input. I understand that we then check to see if the remaining characters can be used to compose some other word. How do we know that we have exhausted all possibilities? – MadPhysicist Jul 24 '16 at 23:20
  • @MadPhysicist the trie structure allows you to take particular advantage of how in english a lot of words are the same but with different endings. So if your input letters for the anagram contain "q", "u" but not "i" then with just 3 moves we can eliminate "quick", "quickly", "quicker" ,"quicken", etc... So it's a structure which groups words into subsets of each other in a practical way. I suspect there's another data structure which also allows you to eliminate all words with the letter "i" and doesn't care about letter order but not sure how to keep the size of it tractable. – Adamantish Sep 11 '19 at 10:30
  • @MadPhysicist The first thing it does is find all words that could be inside the full anagram sentence. It does that by testing all children from the top node then uses recursion to follow in-turn all the grandchildren of those children that passed, etc... That exhausts all possibilities for the first word. For each of those possible first words it then repeats the whole process to get another word with the remaining letters and round again for each of those until all letters are used up just right. As you can imagine, even this can't work well with a long input string. – Adamantish Sep 11 '19 at 18:46
20

For each word in the dictionary, sort the letters alphabetically. So "foobar" becomes "abfoor."

Then when the input anagram comes in, sort its letters too, then look it up. It's as fast as a hashtable lookup!

For multiple words, you could do combinations of the sorted letters, sorting as you go. Still much faster than generating all combinations.

(see comments for more optimizations and details)

gnur
  • 4,671
  • 2
  • 20
  • 33
Jason Cohen
  • 81,399
  • 26
  • 107
  • 114
  • It seems like this (along with Jimmy's answer) would only work for a single word anagram -- how can this be applied to anagrams of phrases? – Serafina Brocious Sep 10 '08 at 20:41
  • As I said in the post, for multiple words you could examine all pairs, triples, etc., in each case combining the letters and sorting (and use mergesort so that op is faster!) and test against that combo. You could be smarter still, e.g. bitfields of chars used at all and... – Jason Cohen Sep 10 '08 at 21:57
  • ...obviously the total number of characters, so when I say "test all triples" there are massive categories you can prune. For example, store words first by length, then hash. So in your pairs/triples you're already able to skip combos with wrong numbers of characters. – Jason Cohen Sep 10 '08 at 21:58
  • sorting the characters first doesn't help at all. It might give you one or two, but you need to test all combinations and then reject them. One way would be to generate all possible triplets and then compare them to the first three letters of all words from a dictionary. – Mats Fredriksson Sep 10 '08 at 22:06
  • 1
    sorting does help -- it's the simplest way (aside from say, using .NET HashSet or Python set()) to map a ordered list of letters to an unordered list. – Jimmy Sep 10 '08 at 22:10
  • 1
    ok, fair enough, it speeds up things in that the anagrams of "foobar" and "barfoo" will resolve to the same result set, but if you are going to get all anagrams from just one sentence, then sorting doesn't help you since you need to consider all characters available. – Mats Fredriksson Sep 10 '08 at 22:22
  • its a matter of reverse lookup, too. once you have your big list of letters ("astronomers"), you find a list of sorted substrings ("mno" + "aat" + "sors", or "mnoo"+"aerrsst" for example) so you can look it up in the lookup table you gener – Jimmy Sep 10 '08 at 22:39
  • @Jason, bitwise operations won't work because a letter may appear more than once in the String. If you use OR, these duplicate letters won't be counted, and if you use addition, there will be collisions. – Zach Langley Dec 28 '08 at 01:07
  • Inefficient for multiple words. See my answer. – FogleBird Dec 17 '09 at 21:01
  • The big question here is: what do the dictionary keys look like? I would make them sorted strings. And values would be all possible anagrams of the keys. – IgorGanapolsky Feb 22 '12 at 21:02
8

See this assignment from the University of Washington CSE department.

Basically, you have a data structure that just has the counts of each letter in a word (an array works for ascii, upgrade to a map if you want unicode support). You can subtract two of these letter sets; if a count is negative, you know one word can't be an anagram of another.

hazzen
  • 17,128
  • 6
  • 41
  • 33
  • working with the counts makes it simple combination problem. you have a map for the search phrase, and match it to combinations of word maps with the same sum of counts. This is an elegant solution. – Osama Al-Maadeed Feb 14 '09 at 00:51
5

Pre-process:

Build a trie with each leaf as a known word, keyed in alphabetical order.

At search time:

Consider the input string as a multiset. Find the first sub-word by traversing the index trie as in a depth-first search. At each branch you can ask, is letter x in the remainder of my input? If you have a good multiset representation, this should be a constant time query (basically).

Once you have the first sub-word, you can keep the remainder multiset and treat it as a new input to find the rest of that anagram (if any exists).

Augment this procedure with memoization for faster look-ups on common remainder multisets.

This is pretty fast - each trie traversal is guaranteed to give an actual subword, and each traversal takes linear time in the length of the subword (and subwords are usually pretty darn small, by coding standards). However, if you really want something even faster, you could include all n-grams in your pre-process, where an n-gram is any string of n words in a row. Of course, if W = #words, then you'll jump from index size O(W) to O(W^n). Maybe n = 2 is realistic, depending on the size of your dictionary.

Tyler
  • 28,498
  • 11
  • 90
  • 106
3

So here's the working solution, in Java, that Jason Cohen suggested and it performs somewhat better than the one using trie. Below are some of the main points:

  • Only load dictionary with the words that are subsets of given set of words
  • Dictionary will be a hash of sorted words as key and set of actual words as values (as suggested by Jason)
  • Iterate through each word from dictionary key and do a recursive forward lookup to see if any valid anagram is found for that key
  • Only do forward lookup because, anagrams for all the words that have already been traversed, should have already been found
  • Merge all the words associated to the keys for e.g. if 'enlist' is the word for which anagrams are to be found and one of the set of keys to merge are [ins] and [elt], and the actual words for key [ins] is [sin] and [ins], and for key [elt] is [let], then the final set of merge words would be [sin, let] and [ins, let] which will be part of our final anagrams list
  • Also to note that, this logic will only list unique set of words i.e. "eleven plus two" and "two plus eleven" would be same and only one of them would be listed in the output

Below is the main recursive code which finds the set of anagram keys:

// recursive function to find all the anagrams for charInventory characters
// starting with the word at dictionaryIndex in dictionary keyList
private Set<Set<String>> findAnagrams(int dictionaryIndex, char[] charInventory, List<String> keyList) {
    // terminating condition if no words are found
    if (dictionaryIndex >= keyList.size() || charInventory.length < minWordSize) {
        return null;
    }

    String searchWord = keyList.get(dictionaryIndex);
    char[] searchWordChars = searchWord.toCharArray();
    // this is where you find the anagrams for whole word
    if (AnagramSolverHelper.isEquivalent(searchWordChars, charInventory)) {
        Set<Set<String>> anagramsSet = new HashSet<Set<String>>();
        Set<String> anagramSet = new HashSet<String>();
        anagramSet.add(searchWord);
        anagramsSet.add(anagramSet);

        return anagramsSet;
    }

    // this is where you find the anagrams with multiple words
    if (AnagramSolverHelper.isSubset(searchWordChars, charInventory)) {
        // update charInventory by removing the characters of the search
        // word as it is subset of characters for the anagram search word
        char[] newCharInventory = AnagramSolverHelper.setDifference(charInventory, searchWordChars);
        if (newCharInventory.length >= minWordSize) {
            Set<Set<String>> anagramsSet = new HashSet<Set<String>>();
            for (int index = dictionaryIndex + 1; index < keyList.size(); index++) {
                Set<Set<String>> searchWordAnagramsKeysSet = findAnagrams(index, newCharInventory, keyList);
                if (searchWordAnagramsKeysSet != null) {
                    Set<Set<String>> mergedSets = mergeWordToSets(searchWord, searchWordAnagramsKeysSet);
                    anagramsSet.addAll(mergedSets);
                }
            }
            return anagramsSet.isEmpty() ? null : anagramsSet;
        }
    }

    // no anagrams found for current word
    return null;
}

You can fork the repo from here and play with it. There are many optimizations that I might have missed. But the code works and does find all the anagrams.

Parth
  • 31
  • 3
3

One of the seminal works on programmatic anagrams was by Michael Morton (Mr. Machine Tool), using a tool called Ars Magna. Here is a light article based on his work.

plinth
  • 48,267
  • 11
  • 78
  • 120
3

And here is my novel solution.

Jon Bentley’s book Programming Pearls contains a problem about finding anagrams of words. The statement:

Given a dictionary of english words, find all sets of anagrams. For instance, “pots”, “stop” and “tops” are all anagrams of one another because each can be formed by permuting the letters of the others.

I thought a bit and it came to me that the solution would be to obtain the signature of the word you’re searching and comparing it with all the words in the dictionary. All anagrams of a word should have the same signature. But how to achieve this? My idea was to use the Fundamental Theorem of Arithmetic:

The fundamental theorem of arithmetic states that

every positive integer (except the number 1) can be represented in exactly one way apart from rearrangement as a product of one or more primes

So the idea is to use an array of the first 26 prime numbers. Then for each letter in the word we get the corresponding prime number A = 2, B = 3, C = 5, D = 7 … and then we calculate the product of our input word. Next we do this for each word in the dictionary and if a word matches our input word, then we add it to the resulting list.

The performance is more or less acceptable. For a dictionary of 479828 words, it takes 160 ms to get all anagrams. This is roughly 0.0003 ms / word, or 0.3 microsecond / word. Algorithm’s complexity seems to be O(mn) or ~O(m) where m is the size of the dictionary and n is the length of the input word.

Here’s the code:

package com.vvirlan;

import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Date;
import java.util.List;
import java.util.Scanner;

public class Words {
    private int[] PRIMES = new int[] { 2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73,
            79, 83, 89, 97, 101, 103, 107, 109, 113 };

    public static void main(String[] args) {
        Scanner s = new Scanner(System.in);
        String word = "hello";
        System.out.println("Please type a word:");
        if (s.hasNext()) {
            word = s.next();
        }
        Words w = new Words();
        w.start(word);
    }

    private void start(String word) {
        measureTime();
        char[] letters = word.toUpperCase().toCharArray();
        long searchProduct = calculateProduct(letters);
        System.out.println(searchProduct);
        try {
            findByProduct(searchProduct);
        } catch (Exception e) {
            e.printStackTrace();
        }
        measureTime();
        System.out.println(matchingWords);
        System.out.println("Total time: " + time);
    }

    private List<String> matchingWords = new ArrayList<>();

    private void findByProduct(long searchProduct) throws IOException {
        File f = new File("/usr/share/dict/words");
        FileReader fr = new FileReader(f);
        BufferedReader br = new BufferedReader(fr);
        String line = null;
        while ((line = br.readLine()) != null) {
            char[] letters = line.toUpperCase().toCharArray();
            long p = calculateProduct(letters);
            if (p == -1) {
                continue;
            }
            if (p == searchProduct) {
                matchingWords.add(line);
            }
        }
        br.close();
    }

    private long calculateProduct(char[] letters) {
        long result = 1L;
        for (char c : letters) {
            if (c < 65) {
                return -1;
            }
            int pos = c - 65;
            result *= PRIMES[pos];
        }
        return result;
    }

    private long time = 0L;

    private void measureTime() {
        long t = new Date().getTime();
        if (time == 0L) {
            time = t;
        } else {
            time = t - time;
        }
    }
}
ACV
  • 9,964
  • 5
  • 76
  • 81
2

I've used the following way of computing anagrams a couple of month ago:

  • Compute a "code" for each word in your dictionary: Create a lookup-table from letters in the alphabet to prime numbers, e.g. starting with ['a', 2] and ending with ['z', 101]. As a pre-processing step compute the code for each word in your dictionary by looking up the prime number for each letter it consists of in the lookup-table and multiply them together. For later lookup create a multimap of codes to words.

  • Compute the code of your input word as outlined above.

  • Compute codeInDictionary % inputCode for each code in the multimap. If the result is 0, you've found an anagram and you can lookup the appropriate word. This also works for 2- or more-word anagrams as well.

Hope that was helpful.

  • 1
    Why such a complicated dictionary... prime numbers, pre-processing, multimap? Just make your dictionary keys to be sorted strings. – IgorGanapolsky Feb 22 '12 at 21:10
  • 1
    See: https://www.scribd.com/document/284697348/A-Fast-Data-Structure-for-Anagrams – Brian Clapper Sep 17 '16 at 17:57
  • @IgorGanapolsky Because by itself that can only give you single word anagrams. The example of "Eleven plus two" wouldn't be possible as an output. – Adamantish Sep 11 '19 at 11:04
2

The book Programming Pearls by Jon Bentley covers this kind of stuff quite nicely. A must-read.

user9282
  • 680
  • 1
  • 7
  • 15
  • Don't know why you were modded down but Column 2 of Programming Pearls walks through an implementation of a program that finds all sets of anagrams given a dictionary of words. Definately worth a look. Compile and run the code as follows: ./sign – vinc456 Jan 12 '09 at 15:37
1

A while ago I have written a blog post about how to quickly find two word anagrams. It works really fast: finding all 44 two-word anagrams for a word with a textfile of more than 300,000 words (4 Megabyte) takes only 0.6 seconds in a Ruby program.

Two Word Anagram Finder Algorithm (in Ruby)

It is possible to make the application faster when it is allowed to preprocess the wordlist into a large hash mapping from words sorted by letters to a list of words using these letters. This preprocessed data can be serialized and used from then on.

martinus
  • 17,736
  • 15
  • 72
  • 92
  • I've deleted my previous comment because it was wrong. Anyway: ("az".sum + "by".sum) - "mmnn".sum => 0. That checksum function is not good for anagram solving – nicecatch Mar 31 '17 at 07:19
  • It's not perfect, but very fast. You need to do a final check with any checksum because the possibility of collisions does not go away. – martinus Mar 31 '17 at 08:27
1

How I see it:

you'd want to build a table that maps unordered sets of letters to lists words i.e. go through the dictionary so you'd wind up with, say

lettermap[set(a,e,d,f)] = { "deaf", "fade" }

then from your starting word, you find the set of letters:

 astronomers => (a,e,m,n,o,o,r,r,s,s,t)

then loop through all the partitions of that set ( this might be the most technical part, just generating all the possible partitions), and look up the words for that set of letters.

edit: hmmm, this is pretty much what Jason Cohen posted.

edit: furthermore, the comments on the question mention generating "good" anagrams, like the examples :). after you build your list of all possible anagrams, run them through WordNet and find ones that are semantically close to the original phrase :)

Jimmy
  • 89,068
  • 17
  • 119
  • 137
1

If I take a dictionary as a Hash Map as every word is unique and the Key is a binary(or Hex) representation of the word. Then if I have a word I can easily find the meaning of it with O(1) complexity.

Now, if we have to generate all the valid anagrams, we need to verify if the generated anagram is in the dictionary, if it is present in dictionary, its a valid one else we need to ignore that.

I will assume that there can be a word of max 100 characters(or more but there is a limit).

So any word we take it as a sequence of indexes like a word "hello" can be represented like "1234". Now the anagrams of "1234" are "1243", "1242" ..etc

The only thing we need to do is to store all such combinations of indexes for a particular number of characters. This is an one time task. And then words can be generated from the combinations by picking the characters from the index.Hence we get the anagrams.

To verify if the anagrams are valid or not, just index into the dictionary and validate.

The only thing need to be handled is the duplicates.That can be done easily. As an when we need to compare with the previous ones that has been searched in dictionary.

The solution emphasizes on performance.

sanjiv
  • 11
  • 1
0

Off the top of my head, the solution that makes the most sense would be to pick a letter out of the input string randomly and filter the dictionary based on words that start with that. Then pick another, filter on the second letter, etc. In addition, filter out words that can't be made with the remaining text. Then when you hit the end of a word, insert a space and start it over with the remaining letters. You might also restrict words based on word type (e.g. you wouldn't have two verbs next to each other, you wouldn't have two articles next to each other, etc).

Serafina Brocious
  • 30,433
  • 12
  • 89
  • 114
0
  1. As Jason suggested, prepare a dictionary making hashtable with key being word sorted alphabetically, and value word itself (you may have multiple values per key).
  2. Remove whitespace and sort your query before looking it up.

After this, you'd need to do some sort of a recursive, exhaustive search. Pseudo code is very roughly:

function FindWords(solutionList, wordsSoFar, sortedQuery)
  // base case
  if sortedQuery is empty
     solutionList.Add(wordsSoFar)
     return

  // recursive case

  // InitialStrings("abc") is {"a","ab","abc"}
  foreach initialStr in InitalStrings(sortedQuery)
    // Remaining letters after initialStr
    sortedQueryRec := sortedQuery.Substring(initialStr.Length)
    words := words matching initialStr in the dictionary
    // Note that sometimes words list will be empty
    foreach word in words
      // Append should return a new list, not change wordSoFar
      wordsSoFarRec := Append(wordSoFar, word) 
      FindWords(solutionList, wordSoFarRec, sortedQueryRec)

In the end, you need to iterate through the solutionList, and print the words in each sublist with spaces between them. You might need to print all orderings for these cases (e.g. "I am Sam" and "Sam I am" are both solutions).

Of course, I didn't test this, and it's a brute force approach.

Mitch Wheat
  • 295,962
  • 43
  • 465
  • 541
dbkk
  • 12,643
  • 13
  • 53
  • 60