Parallel lock-free ascending id generation

Question

I have a map which should associate Strings with an id. There must not be gaps between ids and they must be unique Integers from 0 to N.

Request always comes with two Strings of which one, both or none may have been already indexed. The map is built in parallel from the ForkJoin pool and ideally i would like to avoid explicit synchronized blocks. I am looking for an optimal way to maximize throughput with or without locking.

I don't see how to use AtomicInteger without creating gaps in sequence for the keys which were already present in the map.

public class Foo {
    private final Map<String, Integer> idGenerator = new ConcurrentHashMap<>();

    // invoked from multiple threads
    public void update(String key1, String key2) {
      idGenerator.dosomething(key, ?) // should save the key and unique id
      idGenerator.dosomething(key2, ?) // should save the key2 and its unique id
      Bar bar = new Bar(idGenerator.get(key), idGenerator.get(key2));
      // ... do something with bar
   }
}

I think that size() method combined with merge() might solve the problem but i cannot quite convince myself of that. Could anyone suggest an approach for this problem?

EDIT

Regarding duplicate flag, this cannot be solved with AtomicInteger.incrementAndGet() as suggested in the linked answer. If i did this blindly for every String there would be gaps in sequences. There is a need for compound operation which checks if the key exists and only then generates id. I was looking for a way to implement such compound operation via Map API.

The second provided answer goes against requirements i have specifically laid out in the question.

@Andreas i don't see how this is a duplicate. This cannot be solved with a simple incrementAndGet(). Imagine there was already key in a map, and you invoked incrementAndGet(). I need to first check if a key exists in a hashmap and only if it doesn't invoke incrementAndGet or there will be gaps in a sequence — John, Sep 22 '18 at 08:19
If you want to allow multiple threads to add the same key at the same time for the first time, and it should assign the next globally unique ID to the key, without causing gaps in the ID sequence, then you cannot do it lock-free. If you'll allow potential gap in ID sequence if two threads instantiate the same key at the same time, then `AtomicInteger` is the answer for a lock-free implementation. With extended logic, such gaps could be temporary, i.e. the "gap" ID could be saved for use by the next "new" key. — Andreas, Sep 22 '18 at 08:33
i am thinking that map.merge(key, map.size(), (old, n) -> old) could work, but i cannot figure this out from the documentation. This is what i would like to know, could the size function of concurrent map be used as id generator in the fashion above. — John, Sep 22 '18 at 08:36
Read the javadoc of [`ConcurrentHashMap`](https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ConcurrentHashMap.html): *"Bear in mind that the results of aggregate status methods including **`size`**, `isEmpty`, and `containsValue` are typically useful only when a map is not undergoing concurrent updates in other threads. Otherwise the results of these methods reflect transient states that may be adequate for monitoring or estimation purposes, but not for program control"*. This means that `size()` is not an atomic operation, and hence not an accurate value. — Andreas, Sep 22 '18 at 08:44
Also, `ConcurrentHashMap` is **not lock-free**. Only *retrieval* operations are lock-free. *Update* operations may block, especially if the internal hashtable needs to be expanded. — Andreas, Sep 22 '18 at 08:50
@John: I'm not sure you can do exactly what you want. You can batch some updates, though. Check for hits in the hashmap, and make a unique collection of strings that don't yet have a sequential ID (maybe in another ConcurrentHashMap or set?). Either work in batches (pausing updates to the to-be-added set), or have another single thread taking elements one at a time from the to-be-added set and adding them to the main ConcurrentHashMap, with that one thread doing the numbering. So you have parallelism for dup-finding, but single-threaded sequencing/addition. — Peter Cordes, Sep 22 '18 at 08:55
@Andreas: I'm not sure if what I suggested in my prev comment would be efficient in practice. Can you batch updates by having one thread take a lock that lets it efficiently add multiple elements without unlocking / re-locking? Either way, you avoid atomic increment because it's only one thread doing the numbering of the not-already-added strings. Seems like some kind of read-copy-update (RCU) might be good to let readers keep filtering out duplicates while constructing a new table with with a batch of insertions done. — Peter Cordes, Sep 22 '18 at 08:59
@PeterCordes thank you for suggestion. It seems that locking is required, so i am looking for an optimal way to achieve maximal throughput. Perhaps important to note is that i expect many duplicates. — John, Sep 22 '18 at 09:01
@John: Is high but bursty throughput ok, or do you need require minimal latency between receiving a new string and having it show up in the main hashmap? Do you expect most of new strings to already have numbers? If not, extra copying them around and contending for access to concurrent data structures is maybe not worth it. — Peter Cordes, Sep 22 '18 at 09:04
@PeterCordes latency is not that important. This application should chew large amount of data and eventually produce output. Most of the time there should be a search hit in a map. — John, Sep 22 '18 at 09:07
If you're expecting mostly hits, then letting multiple threads keep checking new keys in parallel as one thread enumerates the non-duplicates should be good. — Peter Cordes, Sep 22 '18 at 09:09
The example `update` method uses the generated IDs right away, i.e., processes the request immediately and the IDs are used in that process - but in the comments it seems you are leaving the door open to potentially batching many `update` calls. Are the `update` calls coming from an external source at unknown times, or are they result of consume some existing data in-process or what? There is a tension here between processing each "request" immediately and batching things for more efficiency. If it's just a bulk processing operation the ideal solution will be very different. — BeeOnRope, Sep 22 '18 at 17:37

score 4 · Accepted Answer · answered Sep 22 '18 at 09:04

4

There is not a way to do it exactly the way you want it -- ConcurrentHashMap is not in and of itself, lock-free. However, you can do it atomically without having to do any explicit lock management by using the java.util.Map.computeIfAbsent function.

Here's a code sample in the style of what you provided that should get you going.

ConcurrentHashMap<String, Integer> keyMap = new ConcurrentHashMap<>();
AtomicInteger sequence = new AtomicInteger();

public void update(String key1, String key2) {
    Integer id1 = keyMap.computeIfAbsent(key1, s -> sequence.getAndIncrement());
    Integer id2 = keyMap.computeIfAbsent(key2, s -> sequence.getAndIncrement());

    Bar bar = new Bar(id1, id2);
    // ... do something with bar
}

answered Sep 22 '18 at 09:04

lscoughlin

2,327
16
23

2

What if 2 threads get the same `key1` at the same time, and both increment the sequence number *for the same key*? Does `computeIfAbsent` stop that from happening by locking the whole map or something? That would probably be less efficient than only having a single thread building the hashmap, unless *most* keys are already present. – Peter Cordes Sep 22 '18 at 09:07
It seems that this should be correct, maybe not perfectly optimal, but it should work correctly.That is what i can figure out from the implementation and documentation. Docs: The entire method invocation is performed atomically, so the function is applied at most once per key. Some attempted update operations on this map by other threads may be blocked while computation is in progress, so the computation should be short and simple, and must not attempt to update any other mappings of this map. – John Sep 22 '18 at 09:20
2

The implementation of computeIfAbsent is atomic for ConcurrentHashMap, so following threads will block -- there's really nothing you can do about that. It does not block the whole map, and does not block on a read. All of that you would know if you read the linked javadoc, which you should do. – lscoughlin Sep 22 '18 at 09:21
[Docs for `ConcurrentMap.computeIfAbsent` say](https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ConcurrentMap.html#computeIfAbsent-K-java.util.function.Function-): *The default implementation may retry these steps when multiple threads attempt updates **including potentially calling the mapping function multiple times**.* So we have a correctness problem @john, unless I'm missing something :( (I'm here for the lock-free tag, not Java specifically. I do mainly C/C++ and x86 asm.) – Peter Cordes Sep 22 '18 at 09:31
3

@PeterCordes that is from the interface specification of the ConcurrentMap. This is specification for the concrete ConcurrentHashMap implementation: https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ConcurrentHashMap.html#computeIfAbsent-K-java.util.function.Function which should not behave like default implementation. – John Sep 22 '18 at 09:44
1

@John: ah thanks. I'm not sure if this will perform better or worse than my answer. It might depend on the hardware (like few vs. many threads, as well as inter-core latency), and various implementation details. Or this might be strictly better than anything I suggested, other than maybe RCU. Or maybe not, if adding things to a lockless queue blocks the readers less than waiting to acquire a lock and update the main HashMap. (See update to my answer: multiple queues could reduce contention further.) – Peter Cordes Sep 22 '18 at 09:51
@PeterCordes AtomicInteger does a CAS swap and it's optimised down to the hardware, so you can't really get better then that. There are implementation details of ConcurrentHashMap that can be fiddled with to get the best possible response out of it, but making it truly tight would require a custom map implementation that did like, a reentrant lock at the Node level or something. All of the information you put in your answer was spot on, it's just java has a nice built in idiom for handling it. – lscoughlin Sep 22 '18 at 12:20
@lscoughlin: Why would it use a CAS on x86, instead of `lock xadd`? A `lock cmpxchg` retry loop would be silly. But anyway, it's still a full memory barrier (x86 doesn't have relaxed atomic RMW), and on any ISA it has to contend with all other threads that are trying to increment the same location. This serializes enumeration, and (if inserts are at all common) could add significant contention between threads. My suggestion has a lot more copying, but having a single thread enumerate avoids all the contention and overhead because it can be a local counter. – Peter Cordes Sep 22 '18 at 12:47
1

Uncontended enqueue into a lockless queue might also take an atomic RMW operation, unless every thread has its own single-producer/single-consumer queue, so we still get the full memory barrier per every non-hit in the main hash, but the difference is that they can be uncontended or much lower contention atomic ops, if there's a separate queue for every 1, or group of 2 or 4, threads. Most of the cost of an atomic RMW is waiting for exclusive access to the cache line. If the nice idiom in your answer is performant enough, then that's great. If not, there's quite likely some perf options. – Peter Cordes Sep 22 '18 at 12:51
@PeterCordes it probably does if you let the JIT warm it up enough. Part of letting VM do that work for you is... letting the VM do that work for you. I don't really get the impression that he's building a micro-latency trading platform ( in which case, I wouldn't use any of the "standard" idioms ) so it's more of a "what is your time worth" rather then "What is the best way" question. I'm impressed by your thought process though, for realz. – lscoughlin Sep 22 '18 at 13:01
1

Mutation of the concurrent structures in Java all definitely take at least one (hopefully just one) atomic operation. I think the solution with `computeIfAbsent` is close to ideal, at least if you want immediate response (i.e., you use the IDs right away as the example shows) and you want the IDs to be "gap free" at every point in time (roughly speaking: here I'm trying to exclude solution that allocate blocks of IDs, perhaps thread-locally, and end up gap-free at the end up but not at intermediate points). – BeeOnRope Sep 22 '18 at 17:29
1

`computeIfAbsent` is basically using the existing mutation mechanism to allow a user-function to be executed in a "locked" manner, and AFAIR that exclusion was already present in the basic add if absent path, so it's somehow "free" if you were set on using CHM anyways. The "locking" here is fine-grained though, I think per-partition in Java 7 and earlier, and per-element in Java 8. So this seems reasonable for the mutate path, and CHM gives a fast read-path which should be good for the common case. Interesting that CM provides weaker guarantees than CHM for `cIfA` - leaving options open... – BeeOnRope Sep 22 '18 at 17:33
From the [JDK source code](http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/tip/src/share/classes/java/util/concurrent/ConcurrentHashMap.java#l1643) you could tell `computeIfAbsent` is only synchronized for small sets of compound operations. – Downhillski Sep 23 '18 at 03:09

Peter Cordes · Answer 2 · 2018-09-22T10:00:28.680

I'm not sure you can do exactly what you want. You can batch some updates, though, or do the checking separately from the enumerating / adding.

A lot of this answer is assuming that order isn't important: you need all the strings given a number, but reordering even within a pair is ok, right? Concurrency could already cause reordering of pairs, or for members of a pair not to get contiguous numbers, but reordering could lead to the first of a pair getting a higher number.

latency is not that important. This application should chew large amount of data and eventually produce output. Most of the time there should be a search hit in a map

If most searches hit, then we mostly need read throughput on the map.

A single writer thread might be sufficient.

So instead of adding directly to the main map, concurrent readers can check their inputs, and if not present, add them to a queue to be enumerated and added to the main ConcurrentHashMap. The queue could be a simple lockless queue, or could be another ConCurrentHashMap to also filter duplicates out of not-yet-added candidates. But probably a lockless queue is good.

Then you don't need an atomic counter, or have any problems with 2 threads incrementing the counter twice when they see the same string before either of them can add it to the map. (Because otherwise that's a big problem.)

If there's a way for a writer to lock the ConcurrentHashMap to make a batch of updates more efficient, that could be good. But if the hit rate is expected to be quite high, you really want other reader threads to keep filtering duplicates as much as possible while we're growing it instead of pausing that.

To reduce contention between the main front-end threads, you could have multiple queues, like maybe each thread has a single-producer / single-consumer queue, or a group of 4 threads running on a pair of physical cores shares one queue.

The enumerating thread reads from all of them.

In a queue where readers don't contend with writers, the enumerating thread has no contention. But multiple queues reduce contention between writers. (The threads writing these queues are the threads that access the main ConcurrentHashMap read-only, where most CPU time will be spent if hit-rates are high.)

Some kind of read-copy-update (RCU) data structure might be good, if Java has that. It would let readers keep filtering out duplicates at full speed, while the enumerating thread constructs a new table with with a batch of insertions done, with zero contention while it's building the new table.

With a 90% hit rate, one writer thread could maybe keep up with 10 or so reader threads that filter new keys against the main table.

You might want to set some queue-size limit to allow for back-pressure from the single writer thread. Or if you have many more cores / threads than a single writer can keep up with, when maybe some kind of concurrent set to let the multiple threads eliminate duplicates before numbering is helpful.

Or really, if you can just wait until the end to number everything, that would be a lot simpler, I think.

I thought about maybe trying to number with room for error on race conditions, and then going back to fix things up, but that probably isn't better.

The exact use case isn't totally clear to me, but if I understand what you are saying, I don't think this can implement the `update` function in the way the OP suggests it is used: the IDs for the strings (either newly created or existing) are used immediately after. The OP says he doesn't care about latency, but he does indicate these are "requests" and the sample code shows the new IDs being used "synchronously", so it isn't clear that you can defer the ID generation until later. That said, I think the accepted CHM answer is close to ideal. Reads in CMHs are very fast ... — BeeOnRope, Sep 22 '18 at 17:18
... the only faster approach I could think of would be a read-only simple `HashMap` (or whatever custom single-threaded map implementation you come up with which is faster) for the initial search, and then the fallback path on a miss would use a shared mutating concurrent map for "recently added" elements, or whatever other strategy - all in the name of speeding up the fast path if hits are very common. Of course with periodic updates of the shared read-only map. Of course that's basically what you're suggesting in the last part, although I think the fallback path could be simpler than that. — BeeOnRope, Sep 22 '18 at 17:23

Parallel lock-free ascending id generation

2 Answers2