Performance of ThreadLocal variable

Question

How much is read from ThreadLocal variable slower than from regular field?

More concretely is simple object creation faster or slower than access to ThreadLocal variable?

I assume that it is fast enough so that having ThreadLocal<MessageDigest> instance is much faster then creating instance of MessageDigest every time. But does that also apply for byte[10] or byte[1000] for example?

Edit: Question is what is really going on when calling ThreadLocal's get? If that is just a field, like any other, then answer would be "it's always fastest", right?

A thread local is bascally a field containing a hashmap and a lookup where the key is the current thread object. It is therefore much slower but still fast. :) — eckes, Jun 27 '15 at 18:44
@eckes: it certainly behaves like that, but it's not usually implemented this way. Instead, `Thread`s contain a (unsynchronized) hashmap where the key is the current `ThreadLocal` object — sbk, Nov 29 '16 at 09:04

score 63 · Answer 1 · edited Dec 24 '20 at 10:49

63

In 2009, some JVMs implemented ThreadLocal using an unsynchronised HashMap in the Thread.currentThread() object. This made it extremely fast (though not nearly as fast as using a regular field access, of course), as well as ensuring that the ThreadLocal object got tidied up when the Thread died. Updating this answer in 2016, it seems most (all?) newer JVMs use a ThreadLocalMap with linear probing. I am uncertain about the performance of those – but I cannot imagine it is significantly worse than the earlier implementation.

Of course, new Object() is also very fast these days, and the garbage collectors are also very good at reclaiming short-lived objects.

Unless you are certain that object creation is going to be expensive, or you need to persist some state on a thread by thread basis, you are better off going for the simpler allocate when needed solution, and only switching over to a ThreadLocal implementation when a profiler tells you that you need to.

edited Dec 24 '20 at 10:49

Lii

11,553
8
64
88

answered Mar 04 '09 at 11:36

Bill Michell

8,240
3
28
33

5

+1 for being the only answer to actually address the question. – cletus Mar 04 '09 at 11:42
Can you give me an example of a modern JVM that doesn't use linear probing for ThreadLocalMap? Java 8 OpenJDK still seems to be using ThreadLocalMap with linear probing. http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/8u40-b25/java/lang/ThreadLocal.java#297 – Karthick Oct 23 '16 at 20:23
1

@Karthick Sorry no I can't. I wrote this back in 2009. I will update. – Bill Michell Oct 31 '16 at 09:20

score 44 · Accepted Answer · answered Mar 04 '09 at 11:56

Running unpublished benchmarks, ThreadLocal.get takes around 35 cycle per iteration on my machine. Not a great deal. In Sun's implementation a custom linear probing hash map in Thread maps ThreadLocals to values. Because it is only ever accessed by a single thread, it can be very fast.

Allocation of small objects take a similar number of cycles, although because of cache exhaustion you may get somewhat lower figures in a tight loop.

Construction of MessageDigest is likely to be relatively expensive. It has a fair amount of state and construction goes through the Provider SPI mechanism. You may be able to optimise by, for instance, cloning or providing the Provider.

Just because it may be faster to cache in a ThreadLocal rather than create does not necessarily mean that the system performance will increase. You will have additional overheads related to GC which slows everything down.

Unless your application very heavily uses MessageDigest you might want to consider using a conventional thread-safe cache instead.

IMHO, the fastest way is just to ignore the SPI and use something like `new org.bouncycastle.crypto.digests.SHA1Digest()`. I'm quite sure no cache can beat it. — maaartinus, Mar 08 '11 at 01:40
I suppose thread local is anyway has to deal with TLS memory access which means CPU cache reset, so it has dramatic influence on overall machine performance — azis.mrazish, Jun 08 '22 at 09:37

axel22 · Answer 3 · 2011-12-16T12:56:01.530

Good question, I've been asking myself that recently. To give you definite numbers, the benchmarks below (in Scala, compiled to virtually the same bytecodes as the equivalent Java code):

var cnt: String = ""
val tlocal = new java.lang.ThreadLocal[String] {
  override def initialValue = ""
}

def loop_heap_write = {                                                                                                                           
  var i = 0                                                                                                                                       
  val until = totalwork / threadnum                                                                                                               
  while (i < until) {                                                                                                                             
    if (cnt ne "") cnt = "!"                                                                                                                      
    i += 1                                                                                                                                        
  }                                                                                                                                               
  cnt                                                                                                                                          
} 

def threadlocal = {
  var i = 0
  val until = totalwork / threadnum
  while (i < until) {
    if (tlocal.get eq null) i = until + i + 1
    i += 1
  }
  if (i > until) println("thread local value was null " + i)
}

available here, were performed on an AMD 4x 2.8 GHz dual-cores and a quad-core i7 with hyperthreading (2.67 GHz).

These are the numbers:

i7

Specs: Intel i7 2x quad-core @ 2.67 GHz Test: scala.threads.ParallelTests

Test name: loop_heap_read

Thread num.: 1 Total tests: 200

Run times: (showing last 5) 9.0069 9.0036 9.0017 9.0084 9.0074 (avg = 9.1034 min = 8.9986 max = 21.0306 )

Thread num.: 2 Total tests: 200

Run times: (showing last 5) 4.5563 4.7128 4.5663 4.5617 4.5724 (avg = 4.6337 min = 4.5509 max = 13.9476 )

Thread num.: 4 Total tests: 200

Run times: (showing last 5) 2.3946 2.3979 2.3934 2.3937 2.3964 (avg = 2.5113 min = 2.3884 max = 13.5496 )

Thread num.: 8 Total tests: 200

Run times: (showing last 5) 2.4479 2.4362 2.4323 2.4472 2.4383 (avg = 2.5562 min = 2.4166 max = 10.3726 )

Test name: threadlocal

Thread num.: 1 Total tests: 200

Run times: (showing last 5) 91.1741 90.8978 90.6181 90.6200 90.6113 (avg = 91.0291 min = 90.6000 max = 129.7501 )

Thread num.: 2 Total tests: 200

Run times: (showing last 5) 45.3838 45.3858 45.6676 45.3772 45.3839 (avg = 46.0555 min = 45.3726 max = 90.7108 )

Thread num.: 4 Total tests: 200

Run times: (showing last 5) 22.8118 22.8135 59.1753 22.8229 22.8172 (avg = 23.9752 min = 22.7951 max = 59.1753 )

Thread num.: 8 Total tests: 200

Run times: (showing last 5) 22.2965 22.2415 22.3438 22.3109 22.4460 (avg = 23.2676 min = 22.2346 max = 50.3583 )

AMD

Specs: AMD 8220 4x dual-core @ 2.8 GHz Test: scala.threads.ParallelTests

Test name: loop_heap_read

Total work: 20000000 Thread num.: 1 Total tests: 200

Run times: (showing last 5) 12.625 12.631 12.634 12.632 12.628 (avg = 12.7333 min = 12.619 max = 26.698 )

Test name: loop_heap_read Total work: 20000000

Run times: (showing last 5) 6.412 6.424 6.408 6.397 6.43 (avg = 6.5367 min = 6.393 max = 19.716 )

Thread num.: 4 Total tests: 200

Run times: (showing last 5) 3.385 4.298 9.7 6.535 3.385 (avg = 5.6079 min = 3.354 max = 21.603 )

Thread num.: 8 Total tests: 200

Run times: (showing last 5) 5.389 5.795 10.818 3.823 3.824 (avg = 5.5810 min = 2.405 max = 19.755 )

Test name: threadlocal

Thread num.: 1 Total tests: 200

Run times: (showing last 5) 200.217 207.335 200.241 207.342 200.23 (avg = 202.2424 min = 200.184 max = 245.369 )

Thread num.: 2 Total tests: 200

Run times: (showing last 5) 100.208 100.199 100.211 103.781 100.215 (avg = 102.2238 min = 100.192 max = 129.505 )

Thread num.: 4 Total tests: 200

Run times: (showing last 5) 62.101 67.629 62.087 52.021 55.766 (avg = 65.6361 min = 50.282 max = 167.433 )

Thread num.: 8 Total tests: 200

Run times: (showing last 5) 40.672 74.301 34.434 41.549 28.119 (avg = 54.7701 min = 28.119 max = 94.424 )

Summary

A thread local is around 10-20x that of the heap read. It also seems to scale well on this JVM implementation and these architectures with the number of processors.

+1 Kudos on being the only one to give quantitative results. I'm a bit skeptical because these tests are in Scala, but like you said, the Java bytecodes should be similar... — Gravity, Sep 01 '11 at 22:39
Thanks! This while loop results in virtually the same bytecode as the corresponding Java code would produce. Different times could be observed on different VMs, though - this has been tested on a Sun JVM1.6. — axel22, Sep 01 '11 at 22:46
This benchmark code does not simulate a good use case for ThreadLocal. In the first method: every thread will have a shared representation in memory, the string does not change. In the second method you benchmark the cost of a hashtable lookup where the string is disjunctive between all threads. — Joelmob, Mar 24 '17 at 10:31
The string does not change, but it's read from memory (the write of `"!"` never occurs) in the first method - the first method is effectively equivalent to subclassing `Thread` and giving it a custom field. The benchmark measures an extreme edge case where the entire computation consists of reading a variable/thread local - real applications may not be affected depending on their access pattern, but in the worst case, they will behave as above. — axel22, Mar 25 '17 at 22:21

score 3 · Answer 4 · answered Sep 07 '17 at 09:19

Here it goes another test. The results shows that ThreadLocal is a bit slower than a regular field, but in the same order. Aprox 12% slower

public class Test {
private static final int N = 100000000;
private static int fieldExecTime = 0;
private static int threadLocalExecTime = 0;

public static void main(String[] args) throws InterruptedException {
    int execs = 10;
    for (int i = 0; i < execs; i++) {
        new FieldExample().run(i);
        new ThreadLocaldExample().run(i);
    }
    System.out.println("Field avg:"+(fieldExecTime / execs));
    System.out.println("ThreadLocal avg:"+(threadLocalExecTime / execs));
}

private static class FieldExample {
    private Map<String,String> map = new HashMap<String, String>();

    public void run(int z) {
        System.out.println(z+"-Running  field sample");
        long start = System.currentTimeMillis();
        for (int i = 0; i < N; i++){
            String s = Integer.toString(i);
            map.put(s,"a");
            map.remove(s);
        }
        long end = System.currentTimeMillis();
        long t = (end - start);
        fieldExecTime += t;
        System.out.println(z+"-End field sample:"+t);
    }
}

private static class ThreadLocaldExample{
    private ThreadLocal<Map<String,String>> myThreadLocal = new ThreadLocal<Map<String,String>>() {
        @Override protected Map<String, String> initialValue() {
            return new HashMap<String, String>();
        }
    };

    public void run(int z) {
        System.out.println(z+"-Running thread local sample");
        long start = System.currentTimeMillis();
        for (int i = 0; i < N; i++){
            String s = Integer.toString(i);
            myThreadLocal.get().put(s, "a");
            myThreadLocal.get().remove(s);
        }
        long end = System.currentTimeMillis();
        long t = (end - start);
        threadLocalExecTime += t;
        System.out.println(z+"-End thread local sample:"+t);
    }
}
}'

Output:

0-Running field sample

0-End field sample:6044

0-Running thread local sample

0-End thread local sample:6015

1-Running field sample

1-End field sample:5095

1-Running thread local sample

1-End thread local sample:5720

2-Running field sample

2-End field sample:4842

2-Running thread local sample

2-End thread local sample:5835

3-Running field sample

3-End field sample:4674

3-Running thread local sample

3-End thread local sample:5287

4-Running field sample

4-End field sample:4849

4-Running thread local sample

4-End thread local sample:5309

5-Running field sample

5-End field sample:4781

5-Running thread local sample

5-End thread local sample:5330

6-Running field sample

6-End field sample:5294

6-Running thread local sample

6-End thread local sample:5511

7-Running field sample

7-End field sample:5119

7-Running thread local sample

7-End thread local sample:5793

8-Running field sample

8-End field sample:4977

8-Running thread local sample

8-End thread local sample:6374

9-Running field sample

9-End field sample:4841

9-Running thread local sample

9-End thread local sample:5471

Field avg:5051

ThreadLocal avg:5664

Env:

openjdk version "1.8.0_131"

Intel® Core™ i7-7500U CPU @ 2.70GHz × 4

Ubuntu 16.04 LTS

Sorry, this isn't even close to being a valid test. A) Biggest issue: you're allocating Strings with every iteration (`Int.toString)`, which is extremely expensive compared to what you're testing. B) you're doing two map ops every iteration, also totally unrelated and expensive. Try incrementing a primitive int from ThreadLocal instead. C) Use `System.nanoTime` instead of `System.currentTimeMillis`, the former is for profiling, the latter is for _user_ date-time purposes and can change under your feet. D) You should avoid allocs entirely, including the top level ones for your "example" classes — Philip Guin, Mar 20 '20 at 19:35

score 3 · Answer 5 · answered Mar 04 '09 at 10:14

@Pete is correct test before you optimise.

I would be very surprised if constructing a MessageDigest has any serious overhead when compared to actaully using it.

Miss using ThreadLocal can be a source of leaks and dangling references, that don't have a clear life cycle, generally I don't ever use ThreadLocal without a very clear plan of when a particular resource will be removed.

score 0 · Answer 6 · answered Mar 04 '09 at 09:42

0

Build it and measure it.

Also, you only need one threadlocal if you encapsulate your message digesting behaviour into an object. If you need a local MessageDigest and a local byte[1000] for some purpose, create an object with a messageDigest and a byte[] field and put that object into the ThreadLocal rather than both individually.

answered Mar 04 '09 at 09:42

Pete Kirkham

48,893
5
92
171

Thanks, MessageDigest and byte[] are different uses, so one object isn't needed. – Sarmun Mar 04 '09 at 09:47

Performance of ThreadLocal variable

6 Answers6

i7

AMD

Summary

Linked