1

I have this code which prints me a list of words sorted by keys (alphabetically) from counts, my ConcurrentHashMap which stores words as keys and their frequencies as values.

// Method to create a stopword list with the most frequent words from the lemmas key in the json file
   private static List<String> StopWordsFile(ConcurrentHashMap<String, String> lemmas) {

// counts stores each word and its frequency
       ConcurrentHashMap<String, Integer> counts = new ConcurrentHashMap<String, Integer>();

// corpus is an array list for all the individual words
           ArrayList<String> corpus = new ArrayList<String>();

           for (Entry<String, String> entry : lemmas.entrySet()) {
               
               String line = entry.getValue().toLowerCase();               
               line = line.replaceAll("\\p{Punct}", " ");
               line = line.replaceAll("\\d+"," ");
               line = line.replaceAll("\\s+", " ");
               line = line.trim();
               String[] value = line.split(" ");

               List<String> words = new ArrayList<String>(Arrays.asList(value));
               corpus.addAll(words);

    }

           // count all the words in the corpus and store the words with each frequency i 
           //counts
           for (String word : corpus) {

               if (counts.keySet().contains(word)) {
                   counts.put(word, counts.get(word) + 1);

               } else {counts.put(word, 1);}
}
// Create a list to store all the words with their frequency and sort it by values.
           List<Entry<String, Integer>> list = new ArrayList<>(counts.entrySet());         
           
           List<String> stopwordslist = new ArrayList<>(counts.keySet()); # this works but counts.values() gives an error
           Collections.sort(stopwordslist);
           System.out.println("List after sorting: " +stopwordslist);

So the output is:

List after sorting: [a, abruptly, absent, abstractmap, accept,...]

How can I sort them by values as well? when I use List stopwordslist = new ArrayList<>(counts.values());

I get an error,

- Cannot infer type arguments for ArrayList<>

I guess that is because ArrayList can store < String > but not <String,Integer> and it gets confused.

I have also tried to do it with a custom Comparator like so:

           Comparator<Entry<String, Integer>> valueComparator = new Comparator<Entry<String,Integer>>() {
               @Override
               public int compare(Entry<String, Integer> e1, Entry<String, Integer> e2) {
                   String v1 = e1.getValue();
                   String v2 = e2.getValue();
                   return v1.compareTo(v2);
               }
           };  
           
           
           List<Entry<String, Integer>> stopwordslist = new ArrayList<Entry<String, Integer>>();
           // sorting HashMap by values using comparator 
           Collections.sort(counts, valueComparator)

which gives me another error,

The method sort(List<T>, Comparator<? super T>) in the type Collections is not applicable for the arguments (ConcurrentHashMap<String,Integer>, Comparator<Map.Entry<String,Integer>>)

how can I sort my list by values?

my expected output is something like

[the, of, value, v, key, to, given, a, k, map, in, for, this, returns, if, is, super, null, specified, u, function, and, ...]
Bluetail
  • 1,093
  • 2
  • 13
  • 27
  • seems to me they are sorted by value. what output did you expect? – Stultuske Sep 01 '22 at 11:23
  • I can think of a more reproducible example but that output is wrong. it is sorted by keys. – Bluetail Sep 01 '22 at 11:24
  • 1
    So you want to sort your map by it's value? Maybe [Sort a Map by values](https://stackoverflow.com/questions/109383/sort-a-mapkey-value-by-values) has the answer you are looking for. – OH GOD SPIDERS Sep 01 '22 at 11:27
  • thanks OH GOD SPIDERS. I do not understand the answer with 986 votes - how would it work in my case? – Bluetail Sep 01 '22 at 11:30
  • 2
    It would work in your case the same as it works in any case: You just copy the class and/or method code and use it by passing your Map into it: `Map sortedMap = MapUtil.sortByValue(counts);` – OH GOD SPIDERS Sep 01 '22 at 11:43

1 Answers1

2

Let’s go through all the issues of your code

  1. Name conventions. Method names should start with a lowercase letter.

  2. Unnecessary use of ConcurrentHashMap. For a purely local use like within you method, an ordinary HashMap will do. For parameters, just use the Map interface, to allow the caller to use whatever Map implementation will fit.

  3. Unnecessarily iterating over the entrySet(). When you’re only interested in the values, you don’t need to use entrySet() and call getValue() on every entry; you can iterate over values() in the first place. Likewise, you would use keySet() when you’re interested in the keys only. Only iterate over entrySet() when you need key and value (or want to perform updates).

  4. Don’t replace pattern matches by spaces, to split by the spaces afterwards. Specify the (combined) pattern directly to split, i.e. line.split("[\\p{Punct}\\d\\s]+").

  5. Don’t use List<String> words = new ArrayList<String>(Arrays.asList(value)); unless you specifically need the features of an ArrayList. Otherwise, just use List<String> words = Arrays.asList(value);
    But when the only thing you’re doing with the list, is addAll to another collection, you can use Collections.addAll(corpus, value); without the List detour.

  6. Don’t use counts.keySet().contains(word) as you can simply use counts.containsKey(word). But you can simplify the entire

    if (counts.containsKey(word)) {
        counts.put(word, counts.get(word) + 1);
    } else {counts.put(word, 1);}
    

    to

    counts.merge(word, 1, Integer::sum);
    
  7. The points above yield

    ArrayList<String> corpus = new ArrayList<>();
    for(String line: lemmas.values()) {
        String[] value = line.toLowerCase().trim().split("[\\p{Punct}\\d\\s]+");
        Collections.addAll(corpus, value);
    }
    for (String word : corpus) {
        counts.merge(word, 1, Integer::sum);
    }
    

    But there is no point in performing two loops, the first only to store everything into a potentially large list, to iterate over it a single time. You can perform the second loop’s operation right in the first (resp. only) loop and get rid of the list.

    for(String line: lemmas.values()) {
        for(String word: line.toLowerCase().trim().split("[\\p{Punct}\\d\\s]+")) {
            counts.merge(word, 1, Integer::sum);
        }
    }
    
  8. You already acknowledged that you can’t sort a map, by copying the map into a list and sorting the list in your first variant. In the second variant, you created a List<Entry<String, Integer>> but then, you didn’t use it at all but rather tried to pass the map to sort. (By the way, since Java 8, you can invoke sort directly on a List, no need to call Collections.sort).
    You have to keep copying the map data into a list and sorting the list. For example,

    List<Map.Entry<String, Integer>> list = new ArrayList<>(counts.entrySet());         
    list.sort(Map.Entry.comparingByValue());
    

    Now, you have to decide whether you change the return type to List<Map.Entry<String, Integer>> or copy the keys of the sorted entries to a new list.

Taking all points together and staying with the original return type, the fixed code looks like

private static List<String> stopWordsFile(Map<String, String> lemmas) {
    Map<String, Integer> counts = new HashMap<>();

    for(String line: lemmas.values()) {
        for(String word: line.toLowerCase().trim().split("[\\p{Punct}\\d\\s]+")) {
            counts.merge(word, 1, Integer::sum);
        }
    }

    List<Map.Entry<String, Integer>> list = new ArrayList<>(counts.entrySet());         
    list.sort(Map.Entry.comparingByValue());

    List<String> stopwordslist = new ArrayList<>();
    for(Map.Entry<String, Integer> e: list) stopwordslist.add(e.getKey());

//    System.out.println("List after sorting: " + stopwordslist);
    return stopwordslist;
}
Holger
  • 285,553
  • 42
  • 434
  • 765
  • wow, thank you very much for taking the time to write this for me! – Bluetail Sep 01 '22 at 16:44
  • the only issue is that I wanted them to be sorted starting with the most frequent words first and get the first 10 or 20 words. list.sort(Map.Entry.comparingByValue().reversed()) does not work? – Bluetail Sep 01 '22 at 17:04
  • 2
    You can replace `Map.Entry.comparingByValue()` with `Map.Entry.comparingByValue(Comparator.reverseOrder())` – Holger Sep 01 '22 at 17:05
  • yep, that worked! is there also a way to limit this list to the first 10 words? – Bluetail Sep 01 '22 at 17:09
  • 1
    Only after the sorting you know which ten elements are the most frequent. You may extract them like `List topTen = new ArrayList<>(stopwordslist.subList(0, Math.min(10, stopwordslist.size())));` – Holger Sep 01 '22 at 17:12
  • 1
    There are algorithms for selecting the top *n* elements from a collection without sorting, but they are more complicated and may only pay off when *n* is significantly smaller than the map (you will only know whether they truly pay off after you implemented both alternatives and compared). In fact, the operations used in the answer are so optimized under the hood that they may be faster than a theoretically superior alternative. – Holger Sep 01 '22 at 17:20
  • thank you so much! I was experimenting with 10% here, stopwordslist.subList(0, (int) (0.10 * list.size())) – Bluetail Sep 01 '22 at 17:20
  • 1
    Keep in mind that a `subList` keeps a reference to the original list, so this might be a case where copying the list into another list like `return new ArrayList<>(stopwordslist.subList(0, …));` is justified, to avoid holding memory occupied longer than necessary. – Holger Sep 01 '22 at 17:24