8

For my assignment I have to replace for loops with streams that count the frequency of words in a text document, and I am having trouble figuring the TODO part out.

String filename = "SophieSallyJack.txt";
if (args.length == 1) {
    filename = args[0];
}
Map<String, Integer> wordFrequency = new TreeMap<>();

List<String> incoming = Utilities.readAFile(filename);

wordFrequency = incoming.stream()
    .map(String::toLowerCase)
    .filter(word -> !word.trim().isEmpty())
    .collect(Collectors.toMap(word -> word, word -> 1, (a, b) -> a + b, TreeMap::new));                

int maxCnt = 0;

// TODO add a single statement that uses streams to determine maxCnt
for (String word : incoming) {
    Integer cnt = wordFrequency.get(word);
    if (cnt != null) {
        if (cnt > maxCnt) {
            maxCnt = cnt;
        }
    }
}
System.out.print("Words that appear " + maxCnt + " times:");

I have tried this:

wordFrequency = incoming.parallelStream().
    collect(Collectors.toConcurrentMap(w -> w, w -> 1, Integer::sum));

But that is not right and I'm not sure how to incorporate maxCnt into the stream.

MC Emperor
  • 22,334
  • 15
  • 80
  • 130
  • `wordFrequency.entrySet().stream().max(Map.Entry.comparingByValue()).get().getKey()` – Tim Biegeleisen Oct 16 '18 at 06:39
  • With your stream I do not get an error, but I am still unsure how to incorporate maxCnt into the stream so it returns the frequency of the words that appear the most often. –  Oct 16 '18 at 06:57
  • Possible duplicate of: [Using Java8 Stream to find the highest values from map](https://stackoverflow.com/questions/42060294/using-java8-stream-to-find-the-highest-values-from-map) – Tim Biegeleisen Oct 16 '18 at 07:08
  • @C.Smith What is the format of your file? How does each line in your file look like? Is it a sentence or just a word? – Ravindra Ranwala Oct 16 '18 at 07:16
  • @RavindraRanwalla The file I am working with is "Sophie Sally and Jack were dachshunds Dachshunds are the best dogs and all dogs are better than cats" And the correct output for the program is "Words that appear 2 times: and are dachshunds dogs". The '2' is maxCnt, it counts the frequency of the most used words in the file, and I am just trying to figure out how to implement maxCnt into the stream without receiving an error –  Oct 16 '18 at 14:58

4 Answers4

2

Assuming you have all the words extracted from a file in a List<String> this word count for each word can be computed using this approach,

Map<String, Long> wordToCountMap = words.stream()
                .collect(Collectors.groupingBy(Function.identity(), Collectors.counting()));

The most freequent word can then be computed using the above map like so,

Entry<String, Long> mostFreequentWord = wordToCountMap.entrySet().stream()
    .max(Map.Entry.comparingByValue())
    .orElse(new AbstractMap.SimpleEntry<>("Invalid", 0l));

You may change the above two pipelines together if you wish like this,

Entry<String, Long> mostFreequentWord = words.stream()
    .collect(Collectors.groupingBy(Function.identity(), Collectors.counting()))
    .entrySet().stream()
    .max(Map.Entry.comparingByValue())
    .orElse(new AbstractMap.SimpleEntry<>("Invalid", 0l));

Update

As per the following discussion it is always good to return an Optional from your computation like so,

Optional<Entry<String, Long>> mostFreequentWord = words.stream()
    .collect(Collectors.groupingBy(Function.identity(), Collectors.counting()))
    .entrySet().stream()
    .max(Map.Entry.comparingByValue());
Ravindra Ranwala
  • 20,744
  • 6
  • 45
  • 63
  • 1
    Rather than second guess requirements, you’re better off throwing an exception: `.orElseThrow(IllegalArgumentException::new)`. The calling code of such an operation may not know that 0 is a special value. – Bohemian Oct 16 '18 at 07:44
  • 1
    @Bohemian well consider that this code counts words, zero has a pretty obvious meaning – Eugene Oct 16 '18 at 08:29
  • @Bohemian I thought that returning a default value would probably be better than throwing an exception? What is your opinion on that? – Ravindra Ranwala Oct 16 '18 at 08:46
  • 1
    @RavindraRanwala the best way it to return an `Optional` of either the count or a map entry. Returning Optional makes it obvious to the caller that there may not be a result (in this case if the file is empty). IMHO returning 0 is not good because “no result” must be checked by testing if the returned value is 0, which means the value 0 is coded on both sides of the API, making it an “agreed” value thus technically letting the implementation leak out. There’s also the choice of the string for the map entry, “Invalid”, blank or null (not allowed). Optional neatly avoids the whole problem. – Bohemian Oct 16 '18 at 13:43
  • @Bohemian That makes sense. Let's return an `Optional`. Thanks for the explanation! – Ravindra Ranwala Oct 16 '18 at 13:49
1

Ok, first of all, your wordFrequency line can make use of Collectors#groupingBy and Collectors#counting instead of writing your own accumulator:

    List<String> incoming = Arrays.asList("monkey", "dog", "MONKEY", "DOG", "giraffe", "giraffe", "giraffe", "Monkey");
    wordFrequency = incoming.stream()
            .filter(word -> !word.trim().isEmpty()) // filter first, so we don't lowercase empty strings
            .map(String::toLowerCase)
            .collect(Collectors.groupingBy(s -> s, Collectors.counting()));

Now that we got that out of the way... Your TODO line says use streams to determine maxCnt. You can do that easily by using max with naturalOrder:

    int maxCnt = wordFrequency.values()
            .stream()
            .max(Comparator.naturalOrder())
            .orElse(0L)
            .intValue();

However, your comments make me think that what you actually want is a one-liner to print the most frequent words (all of them), i.e. the words that have maxCnt as value in wordFrequency. So what we need is to "reverse" the map, grouping the words by count, and then pick the entry with highest count:

    wordFrequency.entrySet().stream() // {monkey=3, dog=2, giraffe=3}
            .collect(groupingBy(Map.Entry::getValue, mapping(Map.Entry::getKey, toList()))).entrySet().stream() // reverse map: {3=[monkey, giraffe], 2=[dog]}
            .max(Comparator.comparingLong(Map.Entry::getKey)) // maxCnt and all words with it: 3=[monkey, giraffe]
            .ifPresent(e -> {
                System.out.println("Words that appear " + e.getKey() + " times: " + e.getValue());
            });

This solution prints all the words with maxCnt, instead of just one:

Words that appear 3 times: [monkey, giraffe].

Of course, you can concatenate the statements to get one big do-it-all statement, like this:

    incoming.stream() // [monkey, dog, MONKEY, DOG, giraffe, giraffe, giraffe, Monkey]
            .filter(word -> !word.trim().isEmpty()) // filter first, so we don't lowercase empty strings
            .map(String::toLowerCase)
            .collect(groupingBy(s -> s, counting())).entrySet().stream() // {monkey=3, dog=2, giraffe=3}
            .collect(groupingBy(Map.Entry::getValue, mapping(Map.Entry::getKey, toList()))).entrySet().stream() // reverse map: {3=[monkey, giraffe], 2=[dog]}
            .max(Comparator.comparingLong(Map.Entry::getKey)) // maxCnt and all words with it: 3=[monkey, giraffe]
            .ifPresent(e -> {
                System.out.println("Words that appear " + e.getKey() + " times: " + e.getValue());
            });

But now we're stretching the meaning of "one statement" :)

walen
  • 7,103
  • 2
  • 37
  • 58
  • I just need maxCnt to add up the frequency of the most used words. The file is "Sophie Sally and Jack were dachshunds Dachshunds are the best dogs and all dogs are better than cats" And the output is "Words that appear 2 times: and are dachshunds dogs" I just need to figure out how to implement '2' into the stream, another part of the program takes care of the 'and are dachshunds dogs' part. maxCnt needs to be an int and when I use your code with naturalOrder, I get an error with orElse(0L) –  Oct 16 '18 at 14:53
  • @C.Smith If `maxCnt` having to be `int` is your only problem, that's great! Because you only have to add `.intValue()` at the end of the statement ;) I'll update my answer. – walen Oct 17 '18 at 08:16
1

Well, you have done almost everything you needed with that TreeMap, but it seems you don't know that it has a method called lastEntry and that is the only one you need to call after you computed wordFrequency to get the word with the highest frequency.

The only problem is that this is not very optimal, since TreeMap sorts the data on each insert and you don't need sorted data, you need the max. Sorting in case of TreeMap is O(nlogn), while inserting into a HashMap is O(n).

So instead of using that TreeMap, all you need to change is to a HashMap:

wordFrequency = incoming.stream()
    .map(String::toLowerCase)
    .filter(word -> !word.trim().isEmpty())
    .collect(Collectors.toMap(
             Function.identity(), 
             word -> 1, 
             (a, b) -> a + b, 
             HashMap::new)); 

Once you have this Map, you need to find max - this operation is O(n) in general and could be achieved with a stream or without one:

 Collections.max(wordFrequency.entrySet(), Map.Entry.comparingByValue())

This approach with give you O(n) for HashMap insert, and O(n) for finding the max - thus O(n) in general, so it's faster than TreeMap

Eugene
  • 117,005
  • 15
  • 201
  • 306
0

By piecing together information I was able to successfully replace the for loop with

    int maxCnt = wordFrequency.values().stream().max(Comparator.naturalOrder()).get();
    System.out.print("Words that appear " + maxCnt + " times:");

I appreciate all the help.

  • It is better to avoid using `.get()` without previously checking for presence - it will throw a `NoSuchElementException` if the input is empty. I prefer to always explicitly throw an exception with a meaningful message `.orElseThrow(() -> MyException("Could not find maximum - input empty?");` – Hulk Oct 17 '18 at 09:04