2

I want to find anagrams in .txt file using Java Stream. Here what I have:

try (InputStream is = new URL("http://wiki.puzzlers.org/pub/wordlists/unixdict.txt").openConnection().getInputStream();
     BufferedReader reader = new BufferedReader(new InputStreamReader(is));
     Stream<String> stream = reader.lines()) {

And the method for anagrams:

public boolean isAnagram(String firstWord, String secondWord) {
    char[] word1 = firstWord.replaceAll("[\\s]", "").toCharArray();
    char[] word2 = secondWord.replaceAll("[\\s]", "").toCharArray();
    Arrays.sort(word1);
    Arrays.sort(word2);
    return Arrays.equals(word1, word2);
}

How to check if a word in unixdict.txt is anagram using Java 8 Stream? Is there any way to compare one word to all words in the stream?

Naman
  • 27,789
  • 26
  • 218
  • 353
dzierzak
  • 79
  • 10

3 Answers3

6

When you want to find all anagrams, it’s not recommended to try to compare one word with all other words, as you’ll end up comparing every word with every other word, which is known as quadratic time complexity. For processing 1,000 words, you would need one millions comparisons, for processing 100,000 words, you would need 10,000,000,000 comparisons and so on.

You may change your isAnagram method to provide a lookup key for data structures like HashMap:

static CharBuffer getAnagramKey(String s) {
    char[] word1 = s.replaceAll("[\\s]", "").toCharArray();
    Arrays.sort(word1);
    return CharBuffer.wrap(word1);
}

The class CharBuffer wraps a char[] array and provides the necessary equals and hashCode methods without copying the array contents, which makes it preferable to constructing a new String.

As a side note, .replaceAll("[\\s]", "") could be simplified to .replaceAll("\\s", ""), both would eliminate all space characters, but the example input of your question has no space characters at all. To remove all non-word characters like apostrophes and ampersands, you could use s.replaceAll("\\W", "").

Then, you may process all words to find anagrams in a single linear pass like

URL srcURL = new URL("http://wiki.puzzlers.org/pub/wordlists/unixdict.txt");
try(InputStream is = srcURL.openStream();
    BufferedReader reader = new BufferedReader(new InputStreamReader(is));
    Stream<String> stream = reader.lines()) {

    stream.collect(Collectors.groupingBy(s -> getAnagramKey(s)))
        .values().stream()
        .filter(l -> l.size() > 1)
        .forEach(System.out::println);
}

With this solution, the printing likely becomes the more expensive part for larger word lists. So you might change the stream’s operation, e.g. the following prints the top ten of anagram combinations:

stream.collect(Collectors.groupingBy(s -> getAnagramKey(s)))
    .values().stream()
    .filter(l -> l.size() > 1)
    .sorted(Collections.reverseOrder(Comparator.comparingInt(List::size)))
    .limit(10)
    .forEach(System.out::println);
Holger
  • 285,553
  • 42
  • 434
  • 765
  • Thank you very much! I have one more question; Do you know how to limit this stream to just the longest array (a word with the biggest amount of anagrams)? I know how to this on lists, but don't know how to this with Java Stream. – dzierzak Dec 02 '19 at 14:01
  • Well, in the last example you could just change `limit(10)` to `limit(1)`, but to get the max element, there’s also the simpler `.max(Comparator.comparingInt(List::size))`, to be used instead of `.sorted(…) .limit(…)` – Holger Dec 02 '19 at 14:09
  • Okay, but if we have few words with the same amount of anagrams, then we always see only the first one :/ – dzierzak Dec 02 '19 at 14:15
  • See [How to force max() to return ALL maximum values in a Java Stream?](https://stackoverflow.com/q/29334404/2711488). Note that the result of the `collect` call in my solution returns a `Map` that can be stored in a variable, so the simple “find all max” iterating over it twice is possible. Of course, the more complicated single-pass solution works too. – Holger Dec 02 '19 at 14:20
  • 1
    Clearly this is more efficient if you want to find all the words that are anagrams. The OP implied (at least to me) they just wanted to find if a particular word was an anagram so I presumed it was a one shot deal. Best thing about your solution, for me, was the `CharBuffer`. I need to remember that! – WJS Dec 02 '19 at 15:01
  • 1
    @WJS I agree that there was a bit of guesswork regarding the final goal. Your solution is not bad for the task of checking for a single word. Re `CharBuffer`, I always hope that my answers for such Q&As help pointing at some useful tools for other, real life tasks. For more elaborated use cases, look at the second half of [this answer](https://stackoverflow.com/a/48172590/2711488)… – Holger Dec 02 '19 at 15:07
1

This works. I first did all the sorts in the stream but this is much more efficient.

      InputStream is = new URL("http://wiki.puzzlers.org/pub/wordlists/unixdict.txt")
              .openConnection().getInputStream();
      BufferedReader reader = new BufferedReader(new InputStreamReader(is));

      String word = "germany";
      final String sword = sortedWord(word);
      reader.lines().filter(w -> sortedWord(w).compareTo(sword) == 0).forEach(
            System.out::println);

      static String sortedWord(String w) {
         char[] chs = w.toCharArray();
         Arrays.sort(chs);
         return String.valueOf(chs);
      }

A possible improvement would be to filter the lengths of the words first. And you might want to try this word list as it has more words in it.

WJS
  • 36,363
  • 4
  • 24
  • 39
  • Instead of `sortedWord(w).compareTo(sword) == 0`, you may consider using `sortedWord(w).equals(sword)`… – Holger Dec 02 '19 at 13:06
0

I think your best option might be to use the multimap collector to convert the stream into a Guava multimap using the sorted version of the string as the key to the map. See Cleanest way to create a guava MultiMap from a java8 stream for an example of how to do this. If you only want the resulting sets of anagrams, you could then use multimap.asMap().entrySet().stream()... to filter and collect the results per your needs.

Don Hosek
  • 981
  • 6
  • 23