0

I'm looking to create a function that calculates the top occurring keywords in some text and returns the top 10 most common keywords.

So far I've been using the following code:

public static String KeywordDensityCalc(String articles )
{
    String[] body = articles.split(" ");
    HashMap<String, Integer> densityTracker = new HashMap<String, Integer>();

    for(int densityCounter = 0; densityCounter < body.length; densityCounter++)
    {

        if (!densityTracker.containsKey(body[densityCounter]))

                {
                  densityTracker.put(body[densityCounter], 1);

                } else
                {
                    densityTracker.put(body[densityCounter], densityTracker.get(body[densityCounter]) + 1);
                }


    }

This allows me to create a hashmap of string (as keys) and number of occurrences (as values). Now my question is how to sort them from highest to lowest and return the top 10 (or some other number)?

  • Check [How to sort a Map on the values in Java?](http://stackoverflow.com/questions/109383/how-to-sort-a-mapkey-value-on-the-values-in-java). – takendarkk Aug 26 '14 at 01:58

2 Answers2

0

You actually don't need to sort the whole map which would take O(nlogn) time.

Create a min heap, add to it the first 10 elements from the map. Iterate over the remaining entries in the map and compare the min value from the heap with current entry count. If the current entry is bigger then pop the minimum element from the heap and push the current entry. This will give you a min heap with 10 top occurrences in O(nlog10) time (or well O(nlogk) where k is the number of top occurrences you want).

Mateusz Dymczyk
  • 14,969
  • 10
  • 59
  • 94
0

Assuming you don't really care about complexity (I presume your "n" is quite small) then you could do something like the following (I have written it in the form of a unit test):

package blah.blah.blah;

import java.util.Comparator;
import java.util.HashMap;
import java.util.Map;
import java.util.Set;
import java.util.TreeMap;

import org.junit.Test;

public class TestTreeMap {

    @Test
    public void testTreeMap() {
        final HashMap<String, Integer> frequencies = new HashMap<String, Integer>();
        frequencies.put("x", 3);
        frequencies.put("y", 1);
        frequencies.put("z", 2);
        final TreeMap<String, Integer> map = new TreeMap<String, Integer>(new KeyComparator(frequencies));
        map.putAll(frequencies);
        final Set<String> keySet = map.keySet();
        System.out.println(keySet);
    }

    public static class KeyComparator implements Comparator<String> {

        public KeyComparator(final Map<String, Integer> map) {
            super();
            this.map = map;
        }

        public int compare(final String o1, final String o2) {
            return this.map.get(o1).compareTo(this.map.get(o2));
        }

        private final Map<String, Integer> map;
    }
}

Obviously, you don't need to add the complication of the unit test. Just create the TreeMap with the given comparator and off you go. In my case, I get the result [y, z, x] which is what I would expect. Essentially, you are decorating the frequency map with a map that can sort the keys according to their integer value. Since the TreeMap is backed by the frequency map, you can call your puts at any time, you don't have to do them before you create the TreeMap.

Phasmid
  • 923
  • 7
  • 19