3

I am working on a project to write a program that finds the 10 most used words in a text, but I got stuck and don't know what I should do next. Can someone help me please?

I came this far only:

import java.io.File;
import java.io.FileNotFoundException;
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;
import java.util.Scanner;
import java.util.regex.Pattern;

public class Lab4 {
    public static void main(String[] args) throws FileNotFoundException {
        Scanner file = new Scanner(new File("text.txt")).useDelimiter("[^a-zA-Z]+");
        List<String> words = new ArrayList<String>();
        while (file.hasNext()){
            String tx = file.next();
            // String x = file.next().toLowerCase();
            words.add(tx);
        }
        Collections.sort(words);
        // System.out.println(words);
    }
}
nickb
  • 59,313
  • 13
  • 108
  • 143
Ingen Alls
  • 49
  • 1
  • 1
  • 2
  • 6
    A `List` of words is not sufficient, you also need a `count` of each occurrence of the words. What data structures would you use for such a task? (Clearly, this is homework, which is why I am posing this question) – nickb Dec 20 '12 at 19:46
  • I think you have a bug with how you're reading the file. file.next() will eventually be null, so you should check for that. – nolegs Dec 20 '12 at 19:49

5 Answers5

10

You can use a Guava Multiset, here is a word-counting example: http://code.google.com/p/guava-libraries/wiki/NewCollectionTypesExplained

And here is how to find the words with the highest count in a Multiset: Simplest way to iterate through a Multiset in the order of element frequency?

UPDATE I wrote this answer in 2012. Since then we have Java 8, and now it is possible to find the 10 most used words in a few lines without external libraries:

List<String> words = ...

// map the words to their count
Map<String, Integer> frequencyMap = words.stream()
         .collect(toMap(
                s -> s, // key is the word
                s -> 1, // value is 1
                Integer::sum)); // merge function counts the identical words

// find the top 10
List<String> top10 = words.stream()
        .sorted(comparing(frequencyMap::get).reversed()) // sort by descending frequency
        .distinct() // take only unique values
        .limit(10)   // take only the first 10
        .collect(toList()); // put it in a returned list

System.out.println("top10 = " + top10);

The static imports are:

import static java.util.Comparator.comparing;
import static java.util.stream.Collectors.toList;
import static java.util.stream.Collectors.toMap;
Community
  • 1
  • 1
lbalazscs
  • 17,474
  • 7
  • 42
  • 50
  • Downvoting because using a library for such a simple task ONLY is way too much of an overkill. – Machinarius Dec 20 '12 at 22:06
  • 2
    Who said that the OP should use Guava "only" for this task? For good Java programmers Guava is like standard collections. You just have to know it. Multimap will hopefully be added to Java 8. – lbalazscs Dec 20 '12 at 22:08
  • Sorry sir, i am not a java developer (hate it, in fact) so i had no idea Guava is such a thing. Point is, the OP's wording and specific question lead me to believe he might just be starting, introducing 3rd party dependencies at that stage is a bad idea. – Machinarius Dec 20 '12 at 22:10
  • 1
    You are supposed to downvote if an answer is "not useful", and not if you think that the answer is "too advanced". Stackoverflow is also for future reference, you do not know who will find this solution elegant and useful in the future... – lbalazscs Dec 20 '12 at 22:42
  • While your comment is completely valid and made me shift my view a bit, i still have to argue that my downvote is valid as other answers were much more apt to the question at hand. I downvoted because this answer, even if valid, should not be seen as the most optimal approach by the OP, at least, again, for what i presume his skill level is. – Machinarius Dec 21 '12 at 23:21
  • "Use your downvotes whenever you encounter an egregiously sloppy, no-effort-expended post, or an answer that is clearly and perhaps dangerously incorrect." - certainly not the case here. http://stackoverflow.com/privileges/vote-down – lbalazscs Jan 03 '13 at 21:25
  • @lbalazscs - Nice job. How can I make the merge and count logic case-insensitive? – Chesser Feb 24 '17 at 00:39
4

Create a map to keep track of occurrences like so:

   Scanner file = new Scanner(new File("text.txt")).useDelimiter("[^a-zA-Z]+");
   HashMap<String, Integer> map = new HashMap<>();

   while (file.hasNext()){
        String word = file.next().toLowerCase();
        if (map.containsKey(word)) {
            map.put(word, map.get(word) + 1);
        } else {
            map.put(word, 0);
        }
    }

    ArrayList<Map.Entry<String, Integer>> entries = new ArrayList<>(map.entrySet());
    Collections.sort(entries, new Comparator<Map.Entry<String, Integer>>() {

        @Override
        public int compare(Map.Entry<String, Integer> a, Map.Entry<String, Integer> b) {
            return a.getValue().compareTo(b.getValue());
        }
    });

    for(int i = 0; i < 10; i++){
        System.out.println(entries.get(entries.size() - i - 1).getKey());
    }
rtheunissen
  • 7,347
  • 5
  • 34
  • 65
1

Here is an even shorter version than the one from lbalazscs that also uses Java 8's streaming API;

Arrays.stream(new String(Files.readAllBytes(PATH_TO_FILE), StandardCharsets.UTF_8).split("\\W+"))
            .collect(Collectors.groupingBy(Function.<String>identity(), HashMap::new, counting()))
            .entrySet()
            .stream()
            .sorted(((o1, o2) -> o2.getValue().compareTo(o1.getValue())))
            .limit(10)
            .forEach(System.out::println);

This will do everything in one go: Load the file, split by non word characters, group the everything by word and assign word count to each group and then for the top ten word print the words with count.

For some indepth discussion about a very similar setup see also: https://stackoverflow.com/a/33946927/327301

Community
  • 1
  • 1
yankee
  • 38,872
  • 15
  • 103
  • 162
0
package src;

import java.io.File;
import java.io.FileNotFoundException;
import java.util.ArrayList;
import java.util.Collections;
import java.util.Comparator;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Scanner;
import java.util.Map.Entry;

public class ScannerTest
{
    public static void main(String[] args) throws FileNotFoundException
        {
        Scanner scanner = new Scanner(new File("G:/Script_nt.txt")).useDelimiter("[^a-zA-Z]+");
        Map<String, Integer> map = new HashMap<String, Integer>();
        while (scanner.hasNext())
            {
            String word = scanner.next();
            if (map.containsKey(word))
                {
                map.put(word, map.get(word)+1);
                }
            else
                {
                map.put(word, 1);
                }
            }

        List<Map.Entry<String, Integer>> entries = new ArrayList<Entry<String,Integer>>( map.entrySet());

        Collections.sort(entries, new Comparator<Map.Entry<String, Integer>>() {

            @Override
            public int compare(Map.Entry<String, Integer> a, Map.Entry<String, Integer> b) {
                return a.getValue().compareTo(b.getValue());
            }
        });

        for(int i = 0; i < map.size(); i++){
            System.out.println(entries.get(entries.size() - i - 1).getKey()+" "+entries.get(entries.size() - i - 1).getValue());
        }
        }
}
-1

Create in input as a string from file or command line and pass it to below method it will return a map containing words as a key and values as their occurrence or count in that sentence or paragraph.

public Map<String,Integer> getWordsWithCount(String sentances)
{
    Map<String,Integer> wordsWithCount = new HashMap<String, Integer>();

    String[] words = sentances.split(" ");
    for (String word : words)
    {
        if(wordsWithCount.containsKey(word))
        {
            wordsWithCount.put(word, wordsWithCount.get(word)+1);
        }
        else
        {
            wordsWithCount.put(word, 1);
        }

    }

    return wordsWithCount;

}
Rais Alam
  • 6,970
  • 12
  • 53
  • 84