1

I am trying to solve the age old anagram problem. Thanks to many tutorials out there, I am able to iterate through a Set of Strings, recursively find all permutations, then compare them against a list of English words. The problem I am finding is that after about three words (usually on something like "anamorphosis"), I get a OutOfMemory error. I tried breaking my batches into small sets because it appears to be the recursive part consuming all my memory. But even just "anamorphosis" locks it up...

Here I read the words from a file into a List

Scanner scanner = new Scanner(resource.getInputStream());
   while (scanner.hasNext()) {
       String s = scanner.nextLine();
        uniqueWords.add(s.toLowerCase());
   }

Now I break them into smaller sets and call a class to generate anagrams:

List<List<String>> subSets = Lists.partition(new ArrayList(uniqueWords), SET_SIZE);

for (List<String> set: subSets) {
      // tried created as class attribute & injection, no difference 
      AnagramGenerator anagramGenerator = new AnagramGenerator();
      List<Word> anagrams = anagramGenerator.createWordList(set);
      wordsRepository.save(anagrams);
      LOGGER.info("Inserted {} records into the database", anagrams.size());
 }

And finally my generator:

public class AnagramGenerator {

private Map<String, List<String>> map = new Hashtable<>();
public List<Word> createWordList(List<String> dictionary) {

   buildAnagrams(dictionary);

   List<Word> words = new ArrayList<>();
   for (Map.Entry<String, List<String>> entry : map.entrySet()) {
       words.add(new Word(entry.getKey(), entry.getValue()));
   }
    return words;
   }

private Map<String, List<String>> buildAnagrams(List<String> dictionary) {

        for (String str : dictionary) {
            String key = sortString(str);
            if (map.get(key) != null) {
                map.get(key).add(str.toLowerCase());
            } else {
                if (str.length() < 2) {
                    map.put(key, new ArrayList<>());
                } else {
                    Set<String> permutations = permutations(str);
                    Set<String> anagramList = new HashSet<>();

                    for (String temp : permutations) {
                        if (dictionary.contains(temp) && !temp.equalsIgnoreCase(str)) {
                            anagramList.add(temp);
                        }
                    }
                    map.put(key, new ArrayList<>(anagramList));
                }
            }
        }
        return map;
    }

   private Set<String> permutations(String str) {    
        if (str.isEmpty()) {
            return Collections.singleton(str);
        } else {
            Set<String> set = new HashSet<>();
            for (int i = 0; i < str.length(); i++)
                for (String s : permutations(str.substring(0, i) + str.substring(i + 1)))
                    set.add(str.charAt(i) + s);
            return set;
        }
    }

Edit: Based upon the excellent feedback I have changed my generator from a permutations to a work lookup:

public class AnagramGenerator {
private Map<String, Set<String>> groupedByAnagram = new HashMap<String, Set<String>>();

    private Set<String> dictionary;

    public AnagramGenerator(Set<String> dictionary) {

        this.dictionary = dictionary;
    }

 public List<Word> searchAlphabetically() {

        List<Word> words = new ArrayList<>();
        for (String word : dictionary) {
            String key = sortString(word);
            if (!groupedByAnagram.containsKey(key)) {
                groupedByAnagram.put(key, new HashSet<>());
            }
            if (!word.equalsIgnoreCase(key)) {
                groupedByAnagram.get(key).add(word);
            }
        }

        for (Map.Entry<String, Set<String>> entry : groupedByAnagram.entrySet()) {
            words.add(new Word(entry.getKey(), new ArrayList(entry.getValue())));
        }

        return words;
    }
 private String sortString(String goodString) {

        char[] letters = goodString.toLowerCase().toCharArray();
        Arrays.sort(letters);
        return new String(letters);
    }

It has a bit more tweaking so I don't add a word as it's own anagram, but otherwise this appears to be blazing fast. And, the code is much cleaner. Thanks everyone!

sonoerin
  • 5,015
  • 23
  • 75
  • 132
  • Where are you getting the error? Stack trace? – Steve Smith Feb 28 '17 at 16:35
  • You're creating a hell of a lot of Sets there.. – SpaceCowboy Feb 28 '17 at 16:37
  • 1
    using recursion to find permutations requires a lot of overhead and typically involves increasing the allotted heap space for your program. I recommend using another way to create all the permutations. – RAZ_Muh_Taz Feb 28 '17 at 16:38
  • I agree that I am creating lots of Sets with my recursion, but so far that has been the only way to do the string manipulation. Any alternative ideas ? Could it be as simple as just a char[] swapper? – sonoerin Feb 28 '17 at 16:43
  • You are unlikely to have enough memory. You need to process the permutations progressively as you generate them. – Peter Lawrey Feb 28 '17 at 16:58
  • 1
    note if you have duplicate letters, you will get less combinations. e.g. `add` has three anagrams but `the` has 6. – Peter Lawrey Feb 28 '17 at 16:59

3 Answers3

5

As noted for longer words, the number of permutations soon becomes enormous.

/usr/share/dict/british-english on Debian has 99,156 lines. There are longer word lists, but let's use that as an example.

The number of permutations for a nine letter word is 9! = 362,880

Therefore, for words of 9 letters or more, it's less computational effort to try every word in the dictionary, than it is to try every permutation of the input word.

10! milliseconds = ~1 hour
12! milliseconds = ~5.54 days
15! milliseconds = ~41.44 years

And you'd be lucky to process one permutation per ms, so you can see you soon get to a number of permutations that are completely impractical to work with. The impact on stack and heap mounts up at the same rate.

So, try the algorithm (pseudocode):

 sorted_input = sort_alphabetically(input_word)
 for each dictionary_word // probably a file readline()
     sorted_dictionary_word = sort_alphabetically(dictionary_word)
     if(sorted_dictionary_word = sorted_input)
         it's an anagram! Handle it
     end 
 end

Similarly, you can fairly quickly write all dictionary-word algorithms into a lookup data structure. Pseudocode again; in Java you could use a Map<String, List<String>> or a MultiMap from Apache Commons or Guava:

  multimap = new MultiMap<String, String> // or whatever

  def build_dict:
      for each dictionary_word // probably a file readline()
          multimap.add(
               sort_alphabetically(dictionary_word), 
               dictionary_word)
      end
  end

  def lookup_anagrams(word):
      return multimap.get(sort_alphabetically(word))
  end 

This takes up a moderate amount of memory (the whole dictionary, plus a bit for the keys and map overheads), but means that once the structure is created, you can query over and over again very cheaply indeed.

If you want to find two-word anagrams, you will need a more complicated and interesting algorithm. But even then, avoiding brute-forcing the entire search-space of permutations is vital to your success.

slim
  • 40,215
  • 13
  • 94
  • 127
2

Doing a quick calculation: "anamorphosis" has 12 letters, which gives 12! = 479,001,600 permutations. Each string takes at least 12 bytes (assuming, say, UTF-8 with ASCII characters only), meaning a total size of 12 * 479,001,600 bytes, which is roughly 6 GB.

Now, the default heap size is, as far as I know, set to 1GB or (if smaller) one quarter of available memory. This is less than the required 6GB.

There are two ways out of this:

  • increase heap size when executing the program, but it will not work for longer words as permutations grow exponentially: with just one more letter, "accomplishing" already requires 78GB.

  • streaming through permutations rather than materializing them into a set of strings. Concretely, this means still using a recursion, but instead of storing each recursively generated permutation, it is processed immediately, then forgotten when moving on to the next one.

Now, if it needs to be done against the entire dictionary, another approach if you have access to a cluster, could be to compute the cartesian product of the dictionary with itself, store it on a distributed file system like HDFS (should be in the order of magnitude of a billion entries), then go through all pairs in parallel using MapReduce and output the pairs that are anagrams from one another. It's more efforts, but the complexity goes down from exponential in the length of the words, to quadratic in the size of the dictionary.

Ghislain Fourny
  • 6,971
  • 1
  • 30
  • 37
  • Note: most Strings with 12 characters will use ~64 bytes of memory. – Peter Lawrey Feb 28 '17 at 16:57
  • Yes, you are right Peter, there is additional overhead. I was optimistic with my lower bound as it was sufficient to make the point. It definitely brings materialization of 12-letter anagrams out of reach of a commodity computer: http://stackoverflow.com/questions/31206851/how-much-memory-does-a-string-use-in-java-8 – Ghislain Fourny Feb 28 '17 at 17:01
  • I have an old computer under my stairs which has 128 GB ;) I am looking forward to upgrading it. – Peter Lawrey Feb 28 '17 at 17:02
  • I do not have access to a cluster, this is strictly run on my laptop. I am looking at using parallel streams to reduce the Set creation quantity. But I am concerned about the amount of memory this is going to take. – sonoerin Feb 28 '17 at 17:07
  • @PeterLawrey Wow, I didn't realize HFT required that much RAM :-) I hope that with the upgrade you will manage to get 13-letter words to work! – Ghislain Fourny Feb 28 '17 at 17:08
  • @sonoerin the cluster is not necessarily needed actually as the cartesian product can also be streamed through. If time is not a concern, maybe a simple double for loop on the dictionary will do the trick, with an isPermutation(String, String) function? If the dictionary is partitioned by length, it can also help speeding up. – Ghislain Fourny Feb 28 '17 at 17:11
  • With the parallel streams I am seeing times: scalpellum = 2 seconds, undermining=22 seconds, and anamorphosis just never finishes. This is a crash course in streams for sure! – sonoerin Feb 28 '17 at 17:16
  • I do not actually mean to run the anagram generation in parallel, as this is too time consuming. I mean to iterate over the dictionary twice in nested loops. If you use @slim's trick of sorting letters, it can be even faster. With a sorted set of sorted strings, it can even be made linear in the size of the dictionary as one for loop will be enough. – Ghislain Fourny Feb 28 '17 at 17:20
  • @GhislainFourny it doesn't usually but if you are going to spend that much money on fast CPU etc you may as well get 256 GB or 512 GB. One client has 3 TB servers. All running Java ;) – Peter Lawrey Feb 28 '17 at 17:20
  • @PeterLawrey ... and I can only dream of the network bandwidth and network latency you must have. :-) – Ghislain Fourny Feb 28 '17 at 17:28
  • @GhislainFourny my latencies at home are 30 micros but hard to justify better. One of my clients gets ~3 micros round trip java to java :P – Peter Lawrey Feb 28 '17 at 17:37
1

Here is an answer that combines slim's approach with mine, "Pseudo-Java code":

Map<String, Set<String>> groupedByAnagram = new HashMap<String, Set<String>>();

for(String word: dictionary)
{
  String footprint = sort_alphabetically(word);
  if(!groupedByAnagram.contains(footprint))
  {
    groupedByAnagram.put(footprint, new HashSet<String>>());
  }
  groupedByAnagram.get(footprint).insert(word); 
}

for(Set<String> anagram: groupedByAnagram.values())
{
  if(anagram.size() > 1)
  {
    System.out.println("Anagram found.");
    for (String word: anagram)
    {
      System.out.println(word);
    }
  } 
}

It first builds an index of all words by "anagram fingerprint" (slim's idea) and then goes through it, only outputting entries with more than one word.

Ghislain Fourny
  • 6,971
  • 1
  • 30
  • 37
  • Wasn't sure who to give the answer to. Slim gave the great idea and Ghislain gave a great implementation. I hope this is the right way to vote. – sonoerin Feb 28 '17 at 20:32
  • Thank you sonoerin, I am glad that it worked. If you can still change, please do not hesitate to give slim the credits for his answer though, as I only wanted to contribute a helpful wrap up. I'll be totally fine and would even prefer for him to get the reputation points, it would just feel "right" to me. :-) – Ghislain Fourny Mar 01 '17 at 07:36
  • I went ahead and gave it to Slim, but it seems like you both solved the problem. I do have a followup question - the above solution will add the original word as an anagram - do you know how i can prevent that? – sonoerin Mar 07 '17 at 21:57
  • 1
    Thanks! The above suggested solution has no original word. What it does is group all the words in the dictionary in bags, each bag containing words that are anagrams of each other. It then prints the bags one by one, but the order inside a bag is arbitrary. If what you need is make a request with a specified word, you can take its anagram fingerprint, look up the corresponding bag in `groupedByAnagram`, and output all words of that bag except the original word. I hope it helps. – Ghislain Fourny Mar 08 '17 at 08:14