4

Currently I am using AtomicLong as a synchronized counter in my application, but I have found that with high concurrency/contention, e.g. with 8 threads my throughput is much lower (75% lower) then single-threaded for obvious reasons (e.g. concurrent CAS).

Use case: A counter variable which

  • is updated by multiple threads concurrently
  • has high write contention, basically every usage in a thread will consist of a write with an immediate read afterwards
  • Requirement is that each read from the counter (immediately after the writing) gets a unique incremented value. It is not required that each retrieved counter value is increasing in the same order as the different threads(writers) increment the value.

So I tried to replace AtomicLong with a LongAdder, and indeed it looks from my measurements that my throughput with 8 threads is much better - (only) about 20% lower than single-threaded (compared to 75%).

However I'm not sure I correctly understand the way LongAdder works. The JavaDoc says:

This class is usually preferable to AtomicLong when multiple threads update a common sum that is used for purposes such as collecting statistics, not for fine-grained synchronization control.

and for sum()

Returns the current sum. The returned value is NOT an atomic snapshot; invocation in the absence of concurrent updates returns an accurate result, but concurrent updates that occur while the sum is being calculated might not be incorporated.

What is meant by fine-grained synchronization control ... From looking at this so question and the source of AtomicLong and Striped64, I think I understand that if the update on an AtomicLong is blocked because of a CAS instruction issued by another thread, the update is stored thread-local and accumulated later to get some eventual consistency. So without further synchronization and because the incrementAndGet() in LongAdder is not atomic but two instructions, I fear the following is possible:

private static final LongAdder counter = new LongAdder(); // == 0
// no further synchronisation happening in java code
Thread#1 :  counter.increment();
Thread#2 :  counter.increment();  // CAS T#1 still ongoing, storing +1 thread-locally
Thread#2 :  counter.sum();       // == 1
Thread#3 :  counter.increment();  // CAS T#1 still ongoing, storing +1 thread-locally
Thread#3 :  counter.sum();       // == 1
Thread#1 :  counter.sum();       // == 3 (after merging everything)

If this is possible, AtomicLong is not really suitable for my use case, which probably then counts as "fine-grained synchronization control".

And then with my write/read^n pattern I probably can't do better then AtomicLong?

Sean Bright
  • 118,630
  • 17
  • 138
  • 146
masch82
  • 63
  • 1
  • 8

2 Answers2

4

LongAdder is definitely not suitable for your use case of unique integer generation, but you don't need to understand the implementation or dig into the intricacies of the java memory model to determine that. Just look at the API: it has no compound "increment and get" type methods that would allow you to increment the value and get the old/new value back, atomically.

In terms of adding values, it only offers void add(long x) and void increment() methods, but these don't return a value. You mention:

the incrementAndGet in LongAdder is not atomic

... but I don't see incrementAndGet at all in LongAdder. Where are you looking?

Your idea of:

  • usage in a thread will consist of a w rite with an immediate read afterwards
  • Requirement is that each read from the counter (immediately after the writing) gets a unique incremented value. It is not required that each retrieved counter value is increasing in the same order as the different threads(writers) increment the value.

Doesn't work even for AtomicLong, unless by "write followed by a read" you mean calling the incrementAndGet method. I think it goes without saying that two separate calls on an AtomicLong or LongAdder (or any other object really) can never be atomic without some external locking.

So the Java doc, in my opinion, is a bit confusing. Yes, you should not use sum() for synchronization control, and yes "concurrent updates that occur while the sum is being calculated might not be incorporated"; however, the same is true of AtomicLong and its get() method. Increments that occur while calling get() similarly may or may not be reflected in the value returned by get().

Now there are some guarantees that are weaker with LongAdder compared to AtomicLong. One guarantee you get with AtomicLong is that a series of operations transition the object though a specific series of values, and where there is no guarantee on what specific value a thread will see, all the values should come from the true set of transition values.

For example, consider starting with an AtomicLong with value zero, and two threads incrementing it concurrently, by 1 and 3 respetively. The final value will always be 4, and only two possible transition paths are possible: 0 -> 1 -> 4 or 0 -> 3 -> 4. For a given execution, only one of those can have occurred and all concurrent reads will be consistent with that execution. That is, if any thread reads a 1, then no thread may read a 3 and vice versa (of course, there is no guarantee that any thread will see a 1 or 3 at all, they may all see 0 or 4.

LongCounter doesn't provide that guarantee. Since the write process is not locked, and the read process adds together several values in a not-atomic fashion, it is possible for one thread to see a 1 and another to see a 3 in the same execution. Of course, it still doesn't synthesize "fake" values - you should never read a "2" for example.

Now that's a bit of a subtle concept and the Javadoc doesn't get it across well. They go with a pretty weak and not particularly formal statement instead. Finally, I don't think you can observe the behavior above with pure increments (rather than additions) since there is only one path then: 0 -> 1 -> 2 -> 3, etc. So for increments, I think AtomicLong.get() and LongCounter.sum() have pretty much the same guarantees.

Something Useful

OK, so I'll give you something that might be useful. You can still implement what you want for efficiently, as long as you don't have strict requirements on the exact relationship between the counter value each thread gets and the order they were read.

Re-purpose the LongAdder Idea

You could make the LongAdder idea work fine for unique counter generation. The underlying idea of LongAdder is to spread the counter into N distinct counters (which live on separate cache lines). Any given call updates one of those counters based on the current thread ID2, and a read needs to sum the values from all counters. This means that writes have low contention, at the cost of a bit more complexity, and at a large cost to reads.

Now way the write works by design doesn't let you read the full LongAdder value, but since you just want a unique value you could use the same code except with the top or bottom N bits3 set uniquely per counter.

Now the write can return the prior value, like getAndIncrement and it will be unique because the fixed bits keep it unique among all counters in that object.

Thread-local Counters

A very fast and simple way is to use a unique value per thread, and a thread-local counter. When the thread local is initialized, it gets a unique ID from a shared counter (only once per thread), and then you combine that ID with a thread-local counter - for example, the bottom 24-bits for the ID, and the top 40-bits for the local counter1. This should be very fast, and more importantly essentially zero contention.

The downside is that the values of the counters won't have any specific relationship among threads (although they may still be strictly increasing within a thread). For example, a thread which has recently requested a counter value may get a much smaller one than a long existing value. You haven't described how you'll use these so I don't know if it is a problem.

Also, you don't have a single place to read the "total" number of counters allocated - you have to examine all the local counters to do that. This is doable if your application requires it (and has some of the same caveats as the LongAdder.sum() function).

A different solution, if you want the numbers to be "generally increasing with time" across threads, and know that every thread requests counter values reasonably frequently, is to use a single global counter, which threads request a local "allocation" of a number of IDs, from which it will then allocate individual IDs in a thread-local manner. For example, threads may request 10 IDs, so that three threads will be allocated the range 0-9, 10-19, and 20-29, etc. They then allocate out of that range until it is exhausted and which point they go back to the global counter. This is similar to how memory allocators carve out chunks of a common pool which can then be allocated thread-local.

The example above will keep the IDs roughly in increasing order over time, and each threads IDs will be strictly increasing as well. It doesn't offer any strict guarantees though: a thread that is allocated the range 0-9, could very well sleep for hours after using 0, and then use "1" when the counters on other threads are much higher. It would reduce contention by a factor of 10.

There are a variety of other approaches you could use and mostof them trade-off contention reduction versus the "accuracy" of the counter assignment versus real time. If you had access to the hardware, you could probably use a quickly incrementing clock like the cycle counter (e.g., rdtscp) and the core ID to get a unique value that is very closely tied to realtime (assuming the OS is synchronizing the counters).


1 The bit-field sizes should be chosen carefully based on the expected number of threads and per-thread increments in your application. In general, if you are constantly creating new threads and your application is long-lived, you may want to err on the side of more bits to the thread ID, since you can always detect a wrap of the local counter and get a new thread ID, so bits allocated to the thread ID can be efficiently shared with the local counters (but not the other way around).

2 The optimal is to use the 'CPU ID', but that's not directly accessible in Java (and even at the assembly level there is no fast and portable way to get it, AFAIK) - so the thread ID is used as a proxy.

3 Where N is lg2(number of counters).

BeeOnRope
  • 60,350
  • 16
  • 207
  • 386
  • Thanks - this answers my question. And also for the additional pointers, unfortunately I can't use them because I'm already using similar techniques to construct a 64-bit unique Id including a timestamp (similar to an UUID), but I'm already limited to 16bit for the counter part and don't want to limit the number space further there with pooling or alike. – masch82 Jul 25 '17 at 05:29
  • One low-contention trick then which you can apply just for a counter part, or for the whole UUID, is "generate and check". For example, lets say you are generating a _probably_ unique value by reading a high-resolution clock, and you want to make it _definitely_ unique. You use you value you read to index into a hash or array-like structure to see if that value was already used (almost never), and if not mark it has used with a CAS. The very usual fast-path here is a relatively uncontended CAS on the structure. If the value is used, you just try the next value. @masch82 – BeeOnRope Jul 25 '17 at 16:28
  • This is effective since you can spread out the contention across as many cache lines as you'd like (you probably only need say 2x the number of CPUs to remove most contention), and the slow path happens only when there really was contention. Of course, there are still various interesting implementation details, such as how you garbage collect the old parts of the hash/array etc, but that's moving into separate question territory. – BeeOnRope Jul 25 '17 at 16:30
  • Finally, if your concern is fully using the 16-bit space, you can still use the batching techniques, but change the algorithm so it uses all the space at the limit, e.g., at some point in the 16-bit cycle (e.g., when you are approaching wraparound), "steal" back any unused values from threads that allocated them but didn't use them. You can also make the batch size adaptive: e.g., first you ask for 1 ID from the global counter, but if you ask again soon then 2, etc. So only active threads get big batches. Combined with stealing you can reduce contention but still use the whole 16-bit space. – BeeOnRope Jul 25 '17 at 16:33
0

There's a subtle difference between the two implementations.

An AtomicLong holds a single number which every thread will attempt to update. Because of this, as you have already found, only one thread can update this value at a time. The advantage, though, is that the value will always be up-to-date when a get is called, as there will be no adds in progress at that time.

A LongAdder, on the other hand, is made up of multiple values, and each value will be updated by a subset of the threads. This results in less contention when updating the value, however it is possible for sum to have an incomplete value if done while an add is in progress, similar to the scenario you described.

LongAdder is recommended for those cases where you will be doing a bunch of adds in parallel followed by a sum at the end. For your use case, I wrote the following which confirmed that around 1 in 10 sums were be repeated (which renders LongAdder unusable for your use case).

public static void main (String[] args) throws Exception
{
    LongAdder adder = new LongAdder();
    ExecutorService executor = Executors.newFixedThreadPool(10);
    Map<Long, Integer> count = new ConcurrentHashMap<>();
    for (int i = 0; i < 10; i++)
    {
        executor.execute(() -> {
            for (int j = 0; j < 1000000; j++)
            {
                adder.add(1);
                count.merge(adder.longValue(), 1, Integer::sum);
            }
        });
    }
    executor.shutdown();
    executor.awaitTermination(1, TimeUnit.HOURS);
    count.entrySet().stream().filter(e -> e.getValue() > 1).forEach(System.out::println);
}
Joe C
  • 15,324
  • 8
  • 38
  • 50
  • Of course there can be `add`s "in progress" when `get` is called. In fact there can be any number of threads executing the `add` method at the same time (it isn't like the method is synchronized). If you run your same test with `AtomicLong` in place of `LongCounter` and keep the separate read of `longValue()` on the `merge()` line you get the _same_ duplicates. The problem isn't "concurrent updates" so much as two separate calls to either class are never going to be atomic. `AtomicLong` differs in that it offers "compound operations" that `LongCounter` does not. – BeeOnRope Jul 23 '17 at 23:16