1

I am writing a program that compares multiple documents based on the words they have in common. I am able to tokenize all of the words and store all of them in an ArrayList since it allows duplicates. However, I am not sure that that is the best way of doing it. I need to find the top 50 most frequent words in the ArrayList, and I am not really sure how to do that. Is there a better data structure for this operation?

user1765804
  • 161
  • 2
  • 3
  • 10

1 Answers1

2

If you just want to compare occurrences, you can use a map such as a HashMap, TreeMap, or any other implementation.

The key will be the word (String), the value will be the number of occurrences (Integer). You'll go over your document, and lookup each word in the map. If it exists, get its current number of occurrences, and increment it by one. If it doesn't, insert the word with an occurrence count of zero. Here's a code snippet:

    HashMap<String, Integer> occurenceMap = new HashMap<>();

    for (String word : document) {
        Integer wordOccurences = occurenceMap.get(word);
        if (wordOccurences == null) {
            wordOccurences = Integer.valueOf(1);
        } else {
            wordOccurences += 1;
        }
        occurenceMap.put(word, wordOccurences);
    }
Malt
  • 28,965
  • 9
  • 65
  • 105
  • Yes, I will do this later on, but first I need to know what the occurrences are. Right now, I have access to ALL of the words from the documents, and they are saved in an ArrayList - as they are with duplicates. Can you explain how I go from that? – user1765804 Jan 30 '15 at 00:47
  • @user1765804 If you insist on the array list (seems wasteful to store the entire document if there's really no need to store duplicates), then just iterate over the array list and use the map to count the occurrence of each word as I explained. – Malt Jan 30 '15 at 00:50
  • I open to doing something else, but I guess I just didn't fully understand what you meant by your first comment. I need the duplication because that's how I know which words are more frequent. I am iterating over the document and adding each word along the way. What should I do at that step instead? – user1765804 Jan 30 '15 at 00:52
  • Ok thank you, that was very helpful. I ended up doing something very similar to that! – user1765804 Jan 30 '15 at 01:22