23

Specifically I need a collection which uses one field A for accessing and a different one (field S) for sorting but a sorted collection which accepts duplicate would be sufficient.

I often come to this point where I need exactly this collection and TreeMap is not an option as it does not allow duplicates. So now it is time to ask here. There are several workarounds as pointed out on stackoverflow here and here - namely there are:

  • PriorityQueue: slow update (remove(Object) + add(Object)), and boxing of primitive keys
  • Fibonacci heap: memory waste (?)
  • TreeMap<Field_S, List<Value>>: problem for me is the memory overhead of the list, and boxing of primitive keys
  • sorted list or array: problem is the slow insert and remove -> should I implement one segmented sorted list?
  • TreeMultimap from guava (docs): external dependency and probably memory inefficient (?)

Anyone with better suggestions? Or should I role my own sorted datastructure (which one?)? Also other sources (in Java, open source, with unit tests and small deps) would be nice.


Update

More details on my use case at the moment (although I'm having similar demand in the last time). I have a collection (with millions) of references where I want to be able

  • to poll or get the smallest element regarding field S
  • and update field S with the help of field A
  • identical values of field S can happen. field A is actually a integer pointing into another array
  • the only dependency I want is trove4j. I could use a different like the mahout collections if that would be required. But not guava as although a nice lib the collections are not tuned to be memory efficient (boxing/unboxing).

So all cries for a fibonacci heap but I fear it has too many overhead per element -> that was the reason I thought about a more memory efficient "sorted+segmented array" solution.

Community
  • 1
  • 1
Karussell
  • 17,085
  • 16
  • 97
  • 197

6 Answers6

7

When you need a sorted collection, you should analyze your needs carefully.
If the majority of operations is inserting and only a few are to search then using a sorted collection i.e. keep the elements sorted in the collection constantly, would not be a good option (due to the overhead of keeping the elements sorted on insert which would be the most common operation).
In this case it would be best to keep an unsorted collection and do the sorting only when needed. I.e. before the search. You could even use a simple List and sort it (using Collections.sort i.e. mergesort) when needed. But I recommend this with caution, as for this to be efficient the assumption is that you work on large data. In really small data even linear search is good enough.

If the majority of operations is searching then you could use a sorted collection which from my of point of view there are data structures to choose from (some you already mention) and you could benchmark to see which one fits your needs.

Cratylus
  • 52,998
  • 69
  • 209
  • 339
3

What about guava TreeMultiset? What you asked for: a sorted collection which accepts duplicates. Don't know anything about its performance though.

Santosh Joshi
  • 3,290
  • 5
  • 36
  • 49
vainolo
  • 6,907
  • 4
  • 24
  • 47
  • I've already added it but I think (didn't look at the code yet) that the implementation is nearly identical to Map>, but the main problem is the dependency – Karussell Oct 10 '12 at 21:22
  • Seems from the code to be a completely new implementation. Lots and lots of code. And you can download the source and add it to your project, so what is the problem? licensing? – vainolo Oct 10 '12 at 21:28
  • jar size. my app should be small and portable. – Karussell Oct 11 '12 at 07:10
  • What about using tools like [proguard](http://proguard.sourceforge.net) to compress the jar after you finish programming? writing your own library just because of jar size is not a good investment of your time. – vainolo Oct 11 '12 at 07:38
  • well, I would like to keep my (open source) project clean and with only a few dependencies. and only because of one datastructure I definitely won't depend on a whole library. I already have trove4j and do not want yet another collection lib – Karussell Oct 11 '12 at 09:53
  • BTW: a treemultiset is just a treemap with an associated counter. so 1. it won't work in my case (the multimap proposed by jakup is the right one) and 2. I could easily implement this on my own (I don't see any optimization code which would improve my case) – Karussell Oct 11 '12 at 10:00
2

I decided to roll my own but not the optimal solution just a TreeMap variant. I'll keep this updated if I'll fine tune this collection regarding memory. Speed is already a lot better then the previous PriorityQueue attempt as I needed the collection.remove(Object) method (for updating an entry):

package com.graphhopper.coll;

import gnu.trove.iterator.TIntIterator;
import gnu.trove.set.hash.TIntHashSet;
import java.util.Map.Entry;
import java.util.TreeMap;

/**
 * A priority queue implemented by a treemap to allow fast key update. Or should we use a standard
 * b-tree?
 */
public class MySortedCollection {

    private int size;
    private int slidingMeanValue = 20;
    private TreeMap<Integer, TIntHashSet> map;

    public MySortedCollection(int size) {
        map = new TreeMap<Integer, TIntHashSet>();
    }

    void remove(int key, int value) {
        TIntHashSet set = map.get(value);
        if (set == null || !set.remove(key))
            throw new IllegalStateException("cannot remove key " + key + " with value " + value
                    + " - did you insert " + key + "," + value + " before?");
        size--;
        if (set.isEmpty())
            map.remove(value);
    }

    public void update(int key, int oldValue, int value) {
        remove(key, oldValue);
        insert(key, value);
    }

    public void insert(int key, int value) {
        TIntHashSet set = map.get(value);
        if (set == null)
            map.put(value, set = new TIntHashSet(slidingMeanValue));
//        else
//            slidingMeanValue = Math.max(5, (slidingMeanValue + set.size()) / 2);
        if (!set.add(key))
            throw new IllegalStateException("use update if you want to update " + key);
        size++;
    }

    public int peekValue() {
        if (size == 0)
            throw new IllegalStateException("collection is already empty!?");
        Entry<Integer, TIntHashSet> e = map.firstEntry();
        if (e.getValue().isEmpty())
            throw new IllegalStateException("internal set is already empty!?");
        return map.firstEntry().getKey();
    }

    public int peekKey() {
        if (size == 0)
            throw new IllegalStateException("collection is already empty!?");
        TIntHashSet set = map.firstEntry().getValue();
        if (set.isEmpty())
            throw new IllegalStateException("internal set is already empty!?");
        return set.iterator().next();
    }

    public int pollKey() {
        size--;
        if (size < 0)
            throw new IllegalStateException("collection is already empty!?");
        Entry<Integer, TIntHashSet> e = map.firstEntry();
        TIntHashSet set = e.getValue();
        TIntIterator iter = set.iterator();
        if (set.isEmpty())
            throw new IllegalStateException("internal set is already empty!?");
        int val = iter.next();
        iter.remove();
        if (set.isEmpty())
            map.remove(e.getKey());
        return val;
    }

    public int size() {
        return size;
    }

    public boolean isEmpty() {
        return size == 0;
    }

    public int getSlidingMeanValue() {
        return slidingMeanValue;
    }

    @Override
    public String toString() {
        return "size " + size + " min=(" + peekKey() + "=>" + peekValue() + ")";
    }
}
Karussell
  • 17,085
  • 16
  • 97
  • 197
1

You need to decide if you want external dependencies or not. I wouldn't roll my own implementation for something like this.

That said, you've told us almost nothing about what you're using this for, and what you plan to do with it. Without enough data, there's only so much we can tell you -- do you actually need to access the elements in random order? How large do you expect this collection to be? We really don't have enough data to pick out the one right data structure for your needs.

That said, here are some options I would consider.

  • ArrayList or PriorityQueue, depending on whether or not you actually need to support remove(Object). Do you? Are you sure? (Even if you do need to support remove(Object), I would choose this option if the collection is likely to stay small.)
  • Not the TreeList you linked to, but instead the Apache Commons Collections TreeList. Despite the name, it doesn't actually maintain sorted order, but what it does is support O(log n) add, remove, and get from anywhere in the list. Using binary search, you could potentially achieve O((log n)^2) time for add, remove, or lookup according to the sorted part of your values.
  • The TreeList you linked to, or -- if you're like me, and care about the List contract -- a custom Guava ListMultimap, obtained with Multimaps.newListMultimap(new TreeMap<K, Collection<V>>, new Supplier<List<V>>() { public List<V> get() { return new ArrayList<V>(); }}).

If you also care about primitive boxing, or can't tolerate third-party dependencies, you're going to have no choice but to write up your own data structure. I'd just adapt one of the implementations above to your primitive type, but this is going to be a royal pain.

Finally: I'd really like to hear your use case. Guava doesn't have any support for things like this because we haven't had enough demand, or seen a use case for which a more sophisticated data structure really appropriate.

Karussell
  • 17,085
  • 16
  • 97
  • 197
Louis Wasserman
  • 191,574
  • 25
  • 345
  • 413
1

I would go with skiplist - more memory efficient than a tree, allows duplicates, provides O(logn) for inserts and deletes. You can even implement an indexed skiplist, it will allow you to have indexed access, something that's hard to get with a tree.

vladich
  • 568
  • 6
  • 9
0

I have good experience with TreeMultimap https://guava.dev/releases/19.0/api/docs/com/google/common/collect/TreeMultimap.html

Santosh Joshi
  • 3,290
  • 5
  • 36
  • 49
jakub.petr
  • 2,951
  • 2
  • 23
  • 34