Need an efficient Map or Set that does NOT produce any garbage when adding and removing

Question

So because Javolution does not work (see here) I am in deep need of a Java Map implementation that is efficient and produces no garbage under simple usage. java.util.Map will produce garbage as you add and remove keys. I checked Trove and Guava but it does not look they have Set<E> implementations. Where can I find a simple and efficient alternative for java.util.Map?

Edit for EJP:

An entry object is allocated when you add an entry, and released to GC when you remove it. :(

   void addEntry(int hash, K key, V value, int bucketIndex) {
        Entry<K,V> e = table[bucketIndex];
        table[bucketIndex] = new Entry<K,V>(hash, key, value, e);
        if (size++ >= threshold)
            resize(2 * table.length);
    }

The Entry object is not released until you remove the entry again, so it is hardly "garbage" (it is an essential part of the hash table). And even if it is was, why is this a problem? What is next? Getting rid of primitive wrappers? — Thilo, Mar 22 '12 at 02:32
That's what I said. If you add and remove an object you are creating garbage for the GC. This is only a problem when you are working with real-time systems, that cannot afford GC latency. Javolution was done for real-time systems, but it does not work (great!). I do agree with you. 99% of the time this is not a problem. But for a real-time system it IS a problem unfortunately. — JohnPristine, Mar 22 '12 at 02:34
What kind of keys and values do you have? Does it need to work with any object? — Thilo, Mar 22 '12 at 02:35
It seems your problem with that other library was resizing as the map grows. All libraries will probably have to do that. Pre-sizing to a maximum capacity upfront is not an option? — Thilo, Mar 22 '12 at 02:38
Why do you want to use java in an environment where running GC is a problem? — Attila, Mar 22 '12 at 02:39
@Thilo The size is NEVER greater than one. So it is not a oversize problem. Check my reply to you in the other question. — JohnPristine, Mar 22 '12 at 02:40
Do you have a good reason for this need? Why do you think a HashMap, for example, is not suited for real-time systems? Have you tried to use it? Have you really measured its performance and find it unacceptable? If you are really that serious about time and memory usage, why not write it in plain C? It will then be fast as a lightning and won't use any resources that a JVM consumes. — Jakub Zaverka, Mar 22 '12 at 02:41
Okay, so that was a bug. Will your real map only also have one (or very few) entries? If so, there are specialized implementations for that. — Thilo, Mar 22 '12 at 02:42
@Attila That deserves a topic on his own. You can choose between C++, RTSJ and Java. We choose Java, sorry. — JohnPristine, Mar 22 '12 at 02:42
Seems like a very bad choice if even the performance of very basic data types is not sufficient for your needs. — Niklas B., Mar 22 '12 at 02:56

Stephen C · Answer 1 · 2012-03-22T04:25:38.793

Taken literally, I am not aware of any existing implementation of Map or Set that never produces any garbage on adding and removing a key.

In fact, the only way that it would even be technically possible (in Java, using the Map and Set APIs as defined) is if you were to place a strict upper bound on the number of entries. Practical Map and Set implementations need extra state proportional to the number of elements they hold. This state has to be stored somewhere, and when the current allocation is exceeded that storage needs to be expanded. In Java, that means that new nodes need to be allocated.

(OK, you could designed a data structure class that held onto old useless nodes for ever, and therefore never generated any collectable garbage ... but it is still generating garbage.)

So what can you do about this in practice ... to reduce the amount of garbage generated. Let's take HashMap as an example:

Garbage is created when you remove an entry. This is unavoidable, unless you replace the hash chains with an implementation that never releases the nodes that represent the chain entries. (And that's a bad idea ... unless you can guarantee that the free node pool size will always be small. See below for why it is a bad idea.)
Garbage is created when the main hash array is resized. This can be avoided in a couple of ways:
- You can give a 'capacity' argument in the HashMap constructor to set the size of the initial hash array large enough that you never need to resize it. (But that potentially wastes space ... especially if you can't accurately predict how big the HashMap is going to grow.)
- You can supply a ridiculous value for the 'load factor' argument to cause the HashMap to never resize itself. (But that results in a HashMap whose hash chains are unbounded, and you end up with O(N) behaviour for lookup, insertion, deletion, etc.

In fact, creating garbage is not necessarily bad for performance. Indeed, hanging onto nodes so that the garbage collector doesn't collect them can actually be worse for performance.

The cost of a GC run (assuming a modern copying collector) is mostly in three areas:

Finding nodes that are not garbage.
Copying those non-garbage nodes to the "to-space".
Updating references in other non-garbage nodes to point to objects in "to-space".

(If you are using a low-pause collector there are other costs too ... generally proportional to the amount of non-garbage.)

The only part of the GC's work that actually depends on the amount of garbage, is zeroing the memory that the garbage objects once occupied to make it ready for reuse. And this can be done with a single bzero call for the entire "from-space" ... or using virtual memory tricks.

Suppose your application / data structure hangs onto nodes to avoid creating garbage. Now, when the GC runs, it has to do extra work to traverse all of those extra nodes, and copy them to "to-space", even though they contain no useful information. Furthermore, those nodes are using memory, which means that if the rest of the application generates garbage there will be less space to hold it, and the GC will need to run more often.

And if you've used weak/soft references to allow the GC to claw back nodes from your data structure, then that's even more work for the GC ... and space to represent those references.

Note: I'm not claiming that object pooling always makes performance worse, just that it often does, especially if the pool gets unexpectedly big.

And of course, that's why HashMap and similar general purpose data structure classes don't do any object pooling. If they did, they would perform significantly badly in situations where the programmer doesn't expect it ... and they would be genuinely broken, IMO.

Finally, there is an easy way to tune a HashMap so that an add immediately followed by a remove of the same key produces no garbage (guaranteed). Wrap it in a Map class that caches the last entry "added", and only does the put on the real HashMap when the next entry is added. Of course, this is NOT a general purpose solution, but it does address the use case of your earlier question.

I don't agree it is a bad idea to pool those entries, specially if you will be adding and removing a bunch of elements without ever having a large map. And it is not extra work, you just add and remove from a pool, probably the same as instantiating and releasing. — chrisapotek, Mar 22 '12 at 03:45
Your statement "because you make more work for the garbage collector" is wrong => http://stackoverflow.com/questions/9005199/java-gcs-overhead-does-it-matter-if-you-have-10mb-or-10gb-of-referenced-objec — JohnPristine, Mar 22 '12 at 03:50
@JohnPristine - reread the accepted answer to that question. It basically agrees with what I'm saying. — Stephen C, Mar 22 '12 at 04:06
@chrisapotek - you are making assumptions about the usage of the map. I'm talking about a general purpose Map where you can't make those assumptions. — Stephen C, Mar 22 '12 at 04:09
@StephenC No! It only matters if you are using the simplest GC of all: mark-and-sweep. All other ones (Generation, Concurrent, G1GC) will take care of that! — JohnPristine, Mar 22 '12 at 04:10
@StephenC You don't even need to make that assumption. With any GC other than the primitive mark-and-sweep it is hard to come up with a scenario where pooling objects are expensive. — chrisapotek, Mar 22 '12 at 04:20
Please cite a paper or book that states this. Alternatively, explain precisely how a copying collector **avoids** having to trace and evacuate objects in an object pool. — Stephen C, Mar 22 '12 at 04:31
(Tonight, I'll attempt to come up with an example that proves my point.) — Stephen C, Mar 22 '12 at 04:39
@StephenC It is explained here: http://stackoverflow.com/questions/9005199/java-gcs-overhead-does-it-matter-if-you-have-10mb-or-10gb-of-referenced-objec — JohnPristine, Mar 22 '12 at 13:35
@JohnPristine - no it is not. The objects in your free pool will start life in the new space, and will be traversed and evacuated **at least once** ... the next time that the youngest generation is scavenged. — Stephen C, Mar 23 '12 at 11:44
@StephenC You are basically saying that any Java program with a lot of memory to manage sucks. Makes no sense. There are alternative GCs to take care of this basic problem as that post explained. — JohnPristine, Mar 23 '12 at 13:49
@JohnPristine - No. I'm disputing **your claim** that the linked Q/A says that the GC will (magically) eliminate the traversing / evacuation of objects in a free object pool. And I might add that you are simply repeating that claim over and over without responding to my challenge to explain HOW it does this, or where the linked Q/A SAYS that it does this. (Simply invoking the generational hypothesis is not an explanation ... see my immediately previous comment.) — Stephen C, Mar 24 '12 at 00:27
@JohnPristine - And I'm not talking about "any Java program". I'm talking about Java programs that maintain free object pools ... which most experienced Java developers think it a bad idea (except in special cases); e.g. http://programmers.stackexchange.com/questions/115163/is-object-pooling-a-deprecated-technique — Stephen C, Mar 24 '12 at 00:44
@StephenC I am NOT worried about SPEED here. If that was my concern you would be right. Pool is worse for speed. I am worried about the latency introduced by GC. And that is far away from being neglected. LATENCY is what real-time systems are concerned with. All GCs (maybe not G1GC?) have something called "STOP-THE-WORLD". Now imagine a high-speed loop using a Map and creating garbage on each iteration. I added another question about this: http://stackoverflow.com/questions/9849205/is-there-a-gc-in-java-that-does-not-introduce-latency-stop-the-world-by-perhap — JohnPristine, Mar 24 '12 at 04:38

score 4 · Answer 2 · answered Mar 22 '12 at 16:20

4

I guess you need a version of HashMap that uses open addressing, and you'll want something better than linear probing. I don't know of a specific recommendation though.

answered Mar 22 '12 at 16:20

Kevin Bourrillion

40,336
12
74
87

Linear probing will only hurt if you have too many collisions, right? Not sure what you mean by open addressing. I am using a pool of entries as Chris suggested. – JohnPristine Mar 22 '12 at 17:14
(I agree with those who say that pooling just to avoid mere instantiation and reclamation of cheap instances is almost always a terrible idea.) – Kevin Bourrillion May 21 '12 at 15:04

score 4 · Answer 3 · answered Mar 23 '12 at 08:10

http://sourceforge.net/projects/high-scale-lib/ has implementations of Set and Map which do not create garbage on add or remove of keys. The implementation uses a single array with alternating keys and values, so put(k,v) does not create an Entry object.

Now, there are some caveats:

Rehash creates garbage b/c it replaces the underlying array
I think this map will rehash given enough interleaved put & delete operations, even if the overall size is stable. (To harvest tombstone values)
This map will create Entry object if you ask for the entry set (one at a time as you iterate)

The class is called NonBlockingHashMap.

score 0 · Accepted Answer · answered Mar 22 '12 at 03:03

0

One option is to try to fix the HashMap implementation to use a pool of entries. I have done that. :) There are also other optimizations for speed you can do there. I agree with you: that issue with Javolution FastMap is mind-boggling. :(

answered Mar 22 '12 at 03:03

chrisapotek

6,007
14
51
85

Need an efficient Map or Set that does NOT produce any garbage when adding and removing

4 Answers4