Multimaps in ChronicleMap

Question

There is definitely a disclaimer on ChronicleMap's GitHub about Multimaps in ChronicleMap:

Chronicle Map is not...

... No secondary indexes.

A multimap. Using a ChronicleMap<K, Collection<V>> as multimap is technically possible, but often leads to problems...

Unfortunately, this is one of my use cases and using off-heap storage for that (with ChronicleMap) would certainly be the easiest route to do so.

Let me try to explain my problem with pizzas. I have 100,000 different pizzas. Each pizza has an ID and lots of different toppings and crusts. I have three access patterns:

Give me the pizza by ID.
Give me all the pizzas that have a particular topping.
Give me all the pizzas that have a particular crust.

I can store the pizzas easily using a ChronicleMap<UUID,Pizza>. But that's only one access pattern. I don't want to have to iterate through every pizza to find the ones with a matching topping or crust. So, I'd want to store something like ChronicleMap<Topping,Collection<UUID>> and ChronicleMap<Crust,Collection<UUID>>.

Then, if someone asks me for all the pizzas with pepperoni, I look up in the topping ChronicleMap to get the UUIDs of the matching pizzas, then in the main pizza map.

But the documentation quoted above scares me. Does anyone know what these "problems" such a thing often leads to might be? Why should I not do this, even though it seems to be working for me? Does it have something to do with how ChronicleMap stores the serialized objects, specifically the collection?

A few additional notes for potential questions:

We may add pizzas later that will require updating the collections, too.
Many processes are trying to do these operations, hence the need for sharing the map via ChronicleMap instead of just a basic ConcurrentMap.

leventov · Accepted Answer · 2016-04-08T14:33:36.743

If actual data indeed resemble pizzas, toppings and crusts, i. e. there are only a handful of distinct toppings/crusts, and thousands of pizzas contain each of them, I would say that having a proper multimap is overkill for this case and you would better have pepperoni_pizzas.dat, onions_pizzas.dat, ... distinct appendable shared lists with UUIDs, you can use Chronicle Queue for access and update them from multiple processes conveniently.

If there are 10s-100s of thousands of toppings/crusts, only 10s-100s of pizzas have a specific topping on average, you should use multimap indeed.

Essentially there are 3 kinds of "problems" with Chronicle-Maps-as-multimaps:

Excessive garbage allocation on each query

If you create a Chronicle Map with List<UUID> or Set<UUID> type of value without specifying custom value serializers, it will work, but it will be utterly inefficient, because it will default to built-in Java serialization for serializing and deserializing the whole value collection on each request, without reusing neither collection heap objects, nor individual UUID heap objects for elements. Hence a lot of garbage will be generated on each request to the ChronicleMap.

Solution However, if you specify value serializer as ListMarshaller or SetMarshaller (or your custom collection marshaller, which you could write based on ListMarshaller and SetMarshaller implementations) in conjunction with reusable UUID heap object, it will resolve this garbage problem:

ListMarshaller<ReusableUuid> valueMarshaller = ListMarshaller.of(
     ReusableUuidReader.INSTANCE, ReusableUuidWriter.INSTANCE);
List<ReusableUuid> averageValue = Stream
    .generate(() -> ReusableUuid.random())
    .limit(averagePizzasForTopping)
    .collect(Collectors.toList());
 ChronicleMap<Topping, List<ReusableUuid>> map = ChronicleMap
     .of(Topping.class, (Class<List<ReusableUuid>>) (Class) List.class)
     .averageKey(pepperoni)
     .valueMarshaller(valueMarshaller)
     .averageValue(averageValue)
     .entries(numberOfToppings)
     .createPersistedTo(new File("toppings_to_pizza_ids.dat"));

Inefficient value updates and replication

When you append another pizza UUID to a list of 100 UUIDs, and insert the new value back to Chronicle Map, Chronicle Map will re-write the whole list again, instead of appending one UUID to the end of off-heap memory chunk. And if you use replication, it will send the whole list of 100 UUIDs as an updated value to other nodes, instead of sending just one added UUID.

Both (value update and replication) could be optimized via terrible hacks, but it requires very deep knowledge of Chronicle Map implementation and will be very fragile.

Fragmentation of Chronicle-Map's memory

If you plan add new pizzas during data store lifetime, memory areas initially allocated for entires will become too small to hold new values with more UUIDs, so memory areas will be re-allocated (possibly several times for each list of UUIDs). Chronicle Map's data stucture design implies simplified memory allocation scheme, which suffers badly from fragmentation, if entries are re-allocated many times.

If you have a lot of UUIDs in lists, and you run your application on Linux, you can mitigate this problem by pre-allocating a lot of memory (more than will practically be needed by any list) for each entry (by specifying .actualChunkSize() in ChronicleMapBuilder configuration) and relying on Linux's feature of lazy mapped memory allocation (page-by-page, as needed). So you will lose at most 4KB of memory for each UUID list, that might be OK if lists are many KBs of size.

On the other hand, if your lists are so long (and they are lists of UUIDs i. e. small structures), and you have only 100 000 pizzas in total, you don't need multimap in the first place, see the beginning of this answer.

The trick with memory overcommit and relying on lazy mapped memory allocation in Linux also would work for short list (collections) of values, but only if elements themselves are big, so that the average total value size is many KBs.

Fragmentation is also less an issue when you can avoid entry memory re-allocation any other way, i. e. new pizza UUIDs are added in time but removed as well, so topping-to-uuids list sizes float around some average and re-allocation is rarely hitted.

Memory fragmentation is never an issue if values are never updated (or never change in size) after entry is inserted into Chronicle Map.

Conclusion

In some use cases and with proper configuration, Chronicle Map could serve as a multimap well. In other cases Chronicle Map as multimap is inherently inefficient.

Factors that matter:

Total number of key -> List<Value> entries in a multimap
Total number of values
Average and distribution of key sizes
Average and distribution of distinct value sizes
Average and distribution of value list sizes
Value lists dynamics over Chronicle Map lifetime (never updated, append only, remove and append. Removes from beginning and middle of lists are more expensive.)
If Chronicle Map is replicated, or not

This helps a ton, and is probably the best answer I've ever gotten on Stack Overflow. Wish I could upvote you twice. Thanks a ton! — Depressio, Apr 07 '16 at 20:46
Question on the `ListMarshaller`. If I'm not using a UUID, but instead just using an Integer (less complex), will I still need to create a read/write marshaller for Integers? Seems like there's already an IntegerMarshaller, so I could probably just do `ListMarshaller.of(IntegerMarshaller.INSTANCE)`. — Depressio, Apr 21 '16 at 14:48
@Depressio there is the same problem with Integer class as with UUID - it is immutable, so ListMarshaller will be required to create a lot of object on each deserialization. In case of integers, it is probably easier and more efficient to copy the whole ListMarshaller class, and specialize it to read and write primitive ints instead of generic elements. You could also use specialized primitive container (eg. from fastutil library) for storing on-heap, instead of generic ArrayList. — leventov, Apr 21 '16 at 14:53
OK, I'll give that a whirl. I'm a bit confused at the concept of a "reusable" class, though. Since a UUID and Integer are immutable, as you say, wouldn't any class containing something similar have the immutable object inside of it, requiring lots of objects on deserialization anyway? I assumed a "ReusableUUID" would be a class that simply contained a UUID in it, but wouldn't deserializing that containing class also have to re-create the internal, immutable UUID anyway? Seems like any identifier would boil down to some sort of immutable thing (int, String, whatever)... — Depressio, Apr 21 '16 at 15:44
@Depressio should boil down to primitives -- that are not objects. I. e. MutableUUID should have two fields: long mostSigBits, long leastSigBits (same as original UUID), but those fields re-settable. Could also leave `Strings` as-is, if applying some de-duplication techniques, and String contents are very often the same. — leventov, Apr 21 '16 at 15:55
@Depressio building hierarchies of such objects might be very cumbersome, but Chronicle Values could help sometimes -- see [example](https://github.com/OpenHFT/Chronicle-Values#another-value-interface) — leventov, Apr 21 '16 at 15:56
The dots aren't connecting for me, sorry. An `Integer` object is immutable, so deserialization will have to re-create them, right? But won't a `ReusableInteger` object with an `int` primitive inside also have to be re-created when deserialized as well? In that case, I'm not seeing how that would reduce object creation... perhaps there's some magic with the deserialization within Chronicle that I'm missing? Perhaps the second argument that comes into the `read` method of the marshaller will actually have a value if the object is mutable? — Depressio, Apr 21 '16 at 16:21
@Depressio the idea is in having a cache ArrayList object (or other container) with ReusableInteger objects, on deserialization, rather than contents of ArrayList drained and arrayList is filled with new objects, just fields of Reusable objects, which are elements of this list, are updated. No new objects are created (if all lists in Chronicle Map are of the same size). If not all lists in Chronicle Map are of the same size, it might sometimes create new objects and discard existing (to adjust cache lists' size to the size of the list currently being deserialized). — leventov, Apr 21 '16 at 16:28
OK, I think I understand that bit now. Since this is a multi-process thing, I assume the marshaller instance would (or could) contain that cached collection which would get reused on deserialization. As long as it has as many ReusableInteger objects as the max collection size, we ought to be good. We'd just loop through the `Bytes` and each object in this cached collection, setting the internal int of each one and adding it to our return collection for the `read`. This mostly makes sense to me... hopefully I'm not too far off base. — Depressio, Apr 21 '16 at 16:43
On second thought, that won't work for multi-threaded processes. Each deserialization will retrieve basically the same instance of the ReusableInteger, and set its value to something different. The case could happen where one thread takes an instance, sets it to 500, but then before using it another thread takes the same instance and sets it to 700. This is probably irrelevant to ChronicleMap, though. — Depressio, Apr 21 '16 at 17:41
@Depressio cache lists should be per-thread. There are as many as two ways to achieve this: 1) Marshller should implement [StatefulCopyable](https://github.com/OpenHFT/Chronicle-Map#understanding-statefulcopyable) (and in fact, it *must* do this, if having some cache fields in marshaller class itself, as said in item 7 of [custom serialization checklist](https://github.com/OpenHFT/Chronicle-Map#custom-serialization-checklist)) — leventov, Apr 21 '16 at 17:53
@Depressio 2) Marshaller implementation shouldn't actually have cache list field itself, in the first place! Because Chronicle Map caches deserialized keys and values itself! and passes to a marshaller as `using` argument. It wouldn't be correct, if those cached values are returned from map.get() method, but they are returned from Data.get(), i. e. if you use [context blocks](https://github.com/OpenHFT/Chronicle-Map#working-with-an-entry-within-a-context-section) — leventov, Apr 21 '16 at 17:57
@Depressio That is why [`ListMarshaller`](https://github.com/OpenHFT/Chronicle-Map/blob/a5277d3656ba69a67741b778abf7aa759f9ba3b4/src/main/java/net/openhft/chronicle/hash/serialization/ListMarshaller.java) actually don't have a cache field (list), it just re-used the object passed as `using` argument. — leventov, Apr 21 '16 at 17:58

Multimaps in ChronicleMap

1 Answers1

Excessive garbage allocation on each query

Inefficient value updates and replication

Fragmentation of Chronicle-Map's memory

Conclusion

Linked