27

I need to have an automatically sorted-by-values map in Java - so that It keeps being sorted at any time while I'm adding new key-value pairs or update the value of an existing key-value pair, or even delete some entry.

Please also have in mind that this map is going to be really big (100's of thousands, or even 10's of millions of entries in size).

So basically I'm looking for the following functionality:

Supposed that we had a class 'SortedByValuesMap' that implements the aforementioned functionality and we have the following code:

SortedByValuesMap<String,Long> sorted_map = new SortedByValuesMap<String, Long>();
sorted_map.put("apples", 4);
sorted_map.put("oranges", 2);
sorted_map.put("bananas", 1);
sorted_map.put("lemons", 3);
sorted_map.put("bananas", 6);

for (String key : sorted_map.keySet()) {
  System.out.println(key + ":" + sorted_map.get(key));
}

the output should be:

bananas:6
apples:4
lemons:3
oranges:2

In particular, what's really important for me, is to be able to get the entry with the lowest value at any time - using a command like:

smallestItem = sorted_map.lastEntry();

which should give me the 'oranges' entry

EDIT: I am a Java newbie so please elaborate a bit in your answers - thanks

EDIT2: This might help: I am using this for counting words (for those who are familiar: n-grams in particular) in huge text files. So I need to build a map where keys are words and values are the frequencies of those words. However, due to limitations (like RAM), I want to keep only the X most frequent words - but you can't know beforehand which are going to be the most frequent words of course. So, the way I thought it might work (as an approximation) is to start counting words and when the map reaches a top-limit (like 1 mil entries) , the least frequent entry will be deleted so as to keep the map's size to 1 mil always.

Mechanical snail
  • 29,755
  • 14
  • 88
  • 113
Alexandros
  • 4,425
  • 4
  • 23
  • 21
  • 1
    millions of entries? why not use a database for that? – Kru Sep 19 '11 at 00:29
  • What if there were two keys with identical lowest-values? What should be the expected behaviour of `lastEntry()`? (E.G. another entry of `limes` -> `2` was in the map) – Peter Sep 19 '11 at 00:31
  • 1
    @Kru: a database would make it really slow – Alexandros Sep 19 '11 at 00:36
  • @Peter returning any of those (limes or oranges) would be fine for me – Alexandros Sep 19 '11 at 00:36
  • Will the contents of the map be known before you want to use it, or will you want to modify it on the fly? – Timothy Jones Sep 19 '11 at 00:52
  • @Timothy Jones it needs to be modified on the fly – Alexandros Sep 19 '11 at 00:58
  • OK. What sort of modifications? Do you expect something like `put("oranges",3)` followed later by `put("oranges",12)`, meaning that the internal ordering needs to change? – Timothy Jones Sep 19 '11 at 00:59
  • @Timothy Jones exactly. Also, there might be a remove("oranges") at some point, which might also change the overall ordering. Please check my second edit which explains the reason for needing this – Alexandros Sep 19 '11 at 01:11
  • 2
    If this is just english, you're over-estimating how many words there are, particularly that are commonly used. – Dave Newton Sep 19 '11 at 01:15
  • Awesome, thanks. That makes it much clearer. I have to head off now, but if your question is still unanswered by the (Australian) evening, I'll come back and hopefully write you up an answer :) – Timothy Jones Sep 19 '11 at 01:15
  • 1
    @Dave Newton you're right - I mentioned words so as not to confuse people who are unfamiliar with n-grams, which are what I am actually counting. N-grams, especially as N goes up, can become really diverse. The possible combinations go up exponentially. – Alexandros Sep 19 '11 at 01:19
  • @Dave Newton: I disagree. Depending on the size of the data set, I think this estimate is realistic. For example, In the first 50 million English pages of this collection http://lemurproject.org/clueweb09.php/ there are around 80 million words. – Timothy Jones Sep 19 '11 at 01:20
  • Oh, gotcha. IMO I'd still just try with two structures, a map, and a sorted list. – Dave Newton Sep 19 '11 at 01:22
  • @Timothy, There are around a quarter-million words defined in the OED, which includes tens of thousands of archaic and words rarely used. While it obviously depends on the corpus, there just aren't that many words in common usage. – Dave Newton Sep 19 '11 at 01:25
  • (I'm baffled how a count of 80 million English words even makes _sense_, or how that figure was derived.) – Dave Newton Sep 19 '11 at 01:29
  • @Dave Newton can you please elaborate a bit on the map and the sorted list solution you are suggesting? I am kinda new in java :\ – Alexandros Sep 19 '11 at 01:34
  • A map of word/count values is kept, a sorted list (sorted on count) is kept, with each entry containing the count, and the word. Lots more memory, but pretty fast. I've used this for markov chaining of english, and music. I'm sure there are more elegant solutions (probably faster, too) but there it is :) – Dave Newton Sep 19 '11 at 01:39
  • @Dave Newton: I agree with your figure for dictionary words, but there will be many more terms if we include misspellings and terms that aren't words (eg numbers, product names, acronyms etc)- which most large text samples will have lots of. I don't think there are 80 million *English* terms in the collection I linked to - but I do think there are around 80 million *distinct* terms in that English corpus :) Of course, if the OP is working with a 'clean' data set, there will be far fewer terms. – Timothy Jones Sep 19 '11 at 01:51
  • @Timothy You said words, so that's what I addressed :) – Dave Newton Sep 19 '11 at 01:54
  • possible duplicate of [TreeMap sort by value](http://stackoverflow.com/questions/2864840/treemap-sort-by-value) – BalusC Sep 19 '11 at 03:52
  • @BalusC: I don't think that's quite the same question - the accepted solution to the other question does a full sort at sort time, while this question is about keeping the `TreeMap` sorted (and thus iterable) all the time. – Timothy Jones Sep 19 '11 at 08:50
  • @Alexandros: As promised, I've returned with some code for you :) – Timothy Jones Sep 19 '11 at 08:50

8 Answers8

4

Keep 2 data structures:

  • A dictionary of words -> count. Just use an ordinary HashMap<String, Long>.
  • An "array" to keep track of order, such that list[count] holds a Set<String> of words with that count.

    I'm writing this as though it were an array as a notational convenience. In fact, you probably don't know an upper bound on the number of occurrences, so you need a resizable data structure. Implement using a Map<Long, Set<String>>. Or, if that uses too much memory, use an ArrayList<Set<String>> (you'll have to test for count == size() - 1, and if so, use add() instead of set(count + 1)).

To increment the number of occurrences for a word (pseudocode):

// assumes data structures are in instance variables dict and arr
public void tally(final String word)
{
    final long count = this.dict.get(word) or 0 if absent;
    this.dict.put(word, count + 1);
    // move word up one place in arr
    this.arr[count].remove(word);   // This is why we use a Set: for fast deletion here.
    this.arr[count + 1].add(word);
}

To iterate over words in order (pseudocode):

for(int count = 0; count < arr.size; count++)
    for(final String word : this.arr[count])
        process(word, count);
Mechanical snail
  • 29,755
  • 14
  • 88
  • 113
2

How about using additional index or only TreeMap<Long, TreeSet<String>> or TreeMap<Long, String> if Long values are distinct?

You can also write a Heap.

Ingo
  • 1,552
  • 10
  • 31
NiematojakTomasz
  • 2,433
  • 20
  • 23
  • Long values are not distinct. Two different entries might have the same Long values - Long values actually represent frequencies – Alexandros Sep 19 '11 at 00:33
  • So you can use `TreeMap>`. – NiematojakTomasz Sep 19 '11 at 00:40
  • that might work but I'm afraid it would double the amount of time since we're doubling the map operations - and in my case, where I have millions of entries that might make a huge difference – Alexandros Sep 19 '11 at 00:50
  • Not that much. Just constant factor will slightly rise. You can also create some pair class like `Map.Entry` and use `TreeSet>`. – NiematojakTomasz Sep 19 '11 at 00:52
  • Sorry to be a nay sayer (I commented on the other answer too), but the `TreeMap>` approach also means that you'd need to know the `Long` that your `String` mapped to before performing lookup - and if you did, then you wouldn't need to perform the lookup anyway... – Timothy Jones Sep 19 '11 at 00:55
  • 1
    Yes, but you can keep both `TreeMap>` and `Map`. I guess there is no provided single data structure in java that do both tricks. In SQL table you would want to have indexes on two columns, so I guess you also need 2 "indexes" in java. – NiematojakTomasz Sep 19 '11 at 01:02
1

Try the solution posted on http://paaloliver.wordpress.com/2006/01/24/sorting-maps-in-java/ . You have the flexibility of doing sorting ascending or descending too.

Here is what they say

import java.util.Comparator;
import java.util.HashMap;
import java.util.Iterator;
import java.util.Map;
import java.util.SortedMap;
import java.util.TreeMap;

public class MapValueSort {

    /** inner class to do soring of the map **/
    private static class ValueComparer implements Comparator<String> {
        private Map<String, String>  _data = null;
        public ValueComparer (Map<String, String> data){
            super();
            _data = data;
        }

         public int compare(String o1, String o2) {
             String e1 = (String) _data.get(o1);
             String e2 = (String) _data.get(o2);
             return e1.compareTo(e2);
         }
    }

    public static void main(String[] args){

        Map<String, String> unsortedData = new HashMap<String, String>();
        unsortedData.put("2", "DEF");
        unsortedData.put("1", "ABC");
        unsortedData.put("4", "ZXY");
        unsortedData.put("3", "BCD");


        SortedMap<String, String> sortedData = new TreeMap<String, String>(new MapValueSort.ValueComparer(unsortedData));

        printMap(unsortedData);

        sortedData.putAll(unsortedData);
        System.out.println();
        printMap(sortedData);
    }

    private static void printMap(Map<String, String> data) {
        for (Iterator<String> iter = data.keySet().iterator(); iter.hasNext();) {
            String key = (String) iter.next();
            System.out.println("Value/key:"+data.get(key)+"/"+key);
        }
    }

}

Outputs

Value/key:BCD/3
Value/key:DEF/2
Value/key:ABC/1
Value/key:ZXY/4

Value/key:ABC/1
Value/key:BCD/3
Value/key:DEF/2
Value/key:ZXY/4
1

I found the need of a similar structure to keep a list of objects ordered by associated values. Based on the suggestion from Mechanical snail in this thread, I coded up a basic implementation of such a map. Feel free to use.

import java.util.*;

/**
 * A map where {@link #keySet()} and {@link #entrySet()} return sets ordered
 * with ascending associated values with respect to the the comparator provided
 * at constuction. The order of two or more keys with identical values is not
 * defined.
 * <p>
 * Several contracts of the Map interface are not satisfied by this minimal
 * implementation.
 */
public class ValueSortedMap<K, V> extends HashMap<K, V> {
    protected Map<V, Collection<K>> valueToKeysMap;

    public ValueSortedMap() {
        this((Comparator<? super V>) null);
    }

    public ValueSortedMap(Comparator<? super V> valueComparator) {
        this.valueToKeysMap = new TreeMap<V, Collection<K>>(valueComparator);
    }

    public boolean containsValue(Object o) {
        return valueToKeysMap.containsKey(o);
    }

    public V put(K k, V v) {
        V oldV = null;
        if (containsKey(k)) {
            oldV = get(k);
            valueToKeysMap.get(oldV).remove(k);
        }
        super.put(k, v);
        if (!valueToKeysMap.containsKey(v)) {
            Collection<K> keys = new ArrayList<K>();
            keys.add(k);
            valueToKeysMap.put(v, keys);
        } else {
            valueToKeysMap.get(v).add(k);
        }
        return oldV;
    }

    public void putAll(Map<? extends K, ? extends V> m) {
        for (Map.Entry<? extends K, ? extends V> e : m.entrySet())
            put(e.getKey(), e.getValue());
    }

    public V remove(Object k) {
        V oldV = null;
        if (containsKey(k)) {
            oldV = get(k);
            super.remove(k);
            valueToKeysMap.get(oldV).remove(k);
        }
        return oldV;
    }

    public void clear() {
        super.clear();
        valueToKeysMap.clear();
    }

    public Set<K> keySet() {
        LinkedHashSet<K> ret = new LinkedHashSet<K>(size());
        for (V v : valueToKeysMap.keySet()) {
            Collection<K> keys = valueToKeysMap.get(v);
            ret.addAll(keys);
        }
        return ret;
    }

    public Set<Map.Entry<K, V>> entrySet() {
        LinkedHashSet<Map.Entry<K, V>> ret = new LinkedHashSet<Map.Entry<K, V>>(size());
        for (Collection<K> keys : valueToKeysMap.values()) {
            for (final K k : keys) {
                final V v = get(k);
                ret.add(new Map.Entry<K,V>() {
                    public K getKey() {
                        return k;
                    }

                    public V getValue() {
                        return v;
                    }

                    public V setValue(V v) {
                        throw new UnsupportedOperationException();
                    }
                });
            }
        }
        return ret;
    }
}

This implementation does not honor all the contracts of the Map interface such as reflecting value changes and removals in the returned key set and entry sets in the actual map, but such a solution would be a bit large to include in a forum like this. Perhaps I will work on one and make it available via github or something similar.

1

Guava BiMap Solution:

//Prepare original data
BiMap<String, Integer> biMap = HashBiMap.create();
biMap.put("apples" , 4);
biMap.put("oranges", 2);
biMap.put("bananas", 1);
biMap.put("lemons" , 3);
biMap.put("bananas", 6);

//Create a desc order SortedMap
SortedMap<Integer, String> sortedMap = new TreeMap<Integer, String>(new Comparator<Integer>(){
    @Override public int compare(Integer o1, Integer o2) {
      return o2-o1;
}});

//Put inversed map
sortedMap.putAll(biMap.inverse());
for (Map.Entry<Integer, String> e: sortedMap.entrySet()) {
      System.out.println(e);
}
System.out.println(sortedMap.lastKey()); 
卢声远 Shengyuan Lu
  • 31,208
  • 22
  • 85
  • 130
0

Update: You cannot sort maps by values, sorry.

You can use SortedMap implementation like TreeMap with Comparator defining order by values (instead of default - by keys).

Or, even better, you can put elements into a PriorityQueue with predefined comparator by values. It should be faster and take less memory compared to TreeMap.

Michał Šrajer
  • 30,364
  • 7
  • 62
  • 85
  • can you please provide an example on how to do this? – Alexandros Sep 19 '11 at 00:37
  • I don't think you can use a priority queue, as the documentation says that the iterator isn't guaranteed to traverse the queue in any particular order. – Timothy Jones Sep 19 '11 at 00:40
  • @Timothy Jones: This is why I suggest to use PriorityQueue as an alternative (if possible). I didn't make it clear. Thanks for pointing it out. – Michał Šrajer Sep 19 '11 at 00:46
  • if I use a TreeMap that orders items by Values, would accessing an item by Key be also fast? – Alexandros Sep 19 '11 at 00:53
  • @Alexandros using the Java TreeMap implementation, getting a value by key is done in log2(n) time, due to the tree structure. I don't know if that's fast _enough_ for you, but it's not constant time. – Peter Sep 19 '11 at 00:55
  • 2
    to be able to order your TreeMap by value, your keys would need to contain the values as well. in which case, you would have a hard time looking up values by key... – jtahlborn Sep 19 '11 at 01:06
0

You may refer to the implementation of java.util.LinkedHashMap. The basic idea is, using a inner linked list to store orders. Here is some details:

Extends from HashMap. In HashMap, each entry has a key and value, that is basic. You can Add a next and a prev pointer to store entries in order by value. And a header and tail pointer to get the first and last entry. For every modification (add, remove, update), you can add your own code to change the list order. It is no more than a linear search and pointer switch.

Sure it will be slow for add/update if there are too many entries because it is a linked list not array. But as long as the list is sorted, I believe there are lots of ways to speedup the search.

So here is what you got: A map that has the same speed with HashMap when retrieving an entry by a key. A linked list which stores entries in order.

We can discuss this further if this solution meets your requirement.


to jtahlborn: As I said, it surely is slow without any optimization. Since we are talking about performance not impl now, lots of things can be done.

One solution is using a tree instead of Linked List, like Red-Black Tree. Then iterate the tree instead of iterator the map.

About the smallest value, it is easier. Just using a member variable to store the smallest, when add or update an element, update the smallest value. When delete, search the tree for the smallest (this is very fast)

if tree is too complex, it is also possible to using another list/array to mark the some positions in the list. for example, maybe 100 element each. Then when search, just search the position list first and then the real list. This list also needs to be maintained, it would be reasonable to recount the position list for certain times of modification, maybe 100.

DeepNightTwo
  • 4,809
  • 8
  • 46
  • 60
  • the OP indicates using a collection with potentially 10s of millions of entries. updating a "sorted" linked list with this many entries would be abysmally slow. – jtahlborn Sep 19 '11 at 04:08
-1

if all you need is the "min" value, then just use a normal map and keep track of the "min" value anytime it is modified.

EDIT:

so, if you really need value ordering and you want to use out-of-the-box solutions, you basically need 2 collections. One normal map (e.g. HashMap), and one SortedSet (e.g. TreeSet>). you can traverse ordered elements via the TreeSet, and find frequencies by key using the HashMap.

obviously, you could always code up something yourself sort of like a LinkedHashMap, where the elements are locatable by key and traversable by order, but that's pretty much going to be entirely custom code (i doubt anything that specific already exists, but i could be wrong).

jtahlborn
  • 52,909
  • 5
  • 76
  • 118
  • because, during the process, at some point I might want to remove the item with the min value. After removing that item, I need to know the next item with the min value. Kinda like the weakest link. – Alexandros Sep 19 '11 at 01:14
  • why the downvote? @Timothy Jones basically wrote up my suggestion as the chosen answer. – jtahlborn Sep 20 '11 at 11:32