Java thread creation overhead

Question

Conventional wisdom tells us that high-volume enterprise java applications should use thread pooling in preference to spawning new worker threads. The use of java.util.concurrent makes this straightforward.

There do exist situations, however, where thread pooling is not a good fit. The specific example which I am currently wrestling with is the use of InheritableThreadLocal, which allows ThreadLocal variables to be "passed down" to any spawned threads. This mechanism breaks when using thread pools, since the worker threads are generally not spawned from the request thread, but are pre-existing.

Now there are ways around this (the thread locals can be explicitly passed in), but this isn't always appropriate or practical. The simplest solution is to spawn new worker threads on demand, and let InheritableThreadLocal do its job.

This brings us back to the question - if I have a high volume site, where user request threads are spawning off half a dozen worker threads each (i.e. not using a thread pool), is this going to give the JVM a problem? We're potentially talking about a couple of hundred new threads being created every second, each one lasting less than a second. Do modern JVMs optimize this well? I remember the days when object pooling was desirable in Java, because object creation was expensive. This has since become unnecessary. I'm wondering if the same applies to thread pooling.

I'd benchmark it, if I knew what to measure, but my fear is that the problems may be more subtle than can be measured with a profiler.

Note: the wisdom of using thread locals is not the issue here, so please don't suggest that I not use them.

I was going to suggest that wrapping your ThreadLocal in an accessor method would probably solve your issues with InheritableThreadLocal, but you don't seem to want to hear that. Plus, it seems that you're using InheritableThreadLocal as an out-of-band call frame, which, to be honest, seems like a code smell. — kdgregory, Jan 22 '10 at 12:27
As far as thread pools go, the main benefit is control: you know that you won't suddenly try to spin up 10,000 threads in a second. — kdgregory, Jan 22 '10 at 12:28
@kdgregory: For your first point, the ThreadLocals in question are used by Spring's bean scoping. That's the way Spring works, and not something I have control over. For your second point, the inbound request threads are limited by tomcat's thread pool, so the limiting is inherent in that. — skaffman, Jan 22 '10 at 12:30
How does the Tomcat thread pool limit the number of threads that you create? You describe an application where "user request threads [spawn] half a dozen worker threads," and I thought your concern was about these threads. One bug and you could easily have 10,000 threads spun up for a single request. — kdgregory, Jan 22 '10 at 13:00
Regarding the reason you need ThreadLocal, however: it's valid, and a good thing to post in the message to avoid smart-ass comments :-) — kdgregory, Jan 22 '10 at 13:00
FYI, [*Project Loom*](https://wiki.openjdk.java.net/display/loom/Main) is trying to bring "virtual threads" (fibers) as another tool in the Java concurrency toolbox. Virtual threads are *very* cheap in terms of fast performance, stacks in memory that grow and shrink as needed, and automatic thread "parking" (set aside) when code blocks. I do not know if/how virtual threads work with `InheritableThreadLocal`. The Loom team is soliciting feedback if anybody would like to try their experimental builds based on early-access Java 17. — Basil Bourque, Mar 12 '21 at 08:05

Jaan · Accepted Answer · 2016-11-30T23:38:58.637

Here is an example microbenchmark:

public class ThreadSpawningPerformanceTest {
static long test(final int threadCount, final int workAmountPerThread) throws InterruptedException {
    Thread[] tt = new Thread[threadCount];
    final int[] aa = new int[tt.length];
    System.out.print("Creating "+tt.length+" Thread objects... ");
    long t0 = System.nanoTime(), t00 = t0;
    for (int i = 0; i < tt.length; i++) { 
        final int j = i;
        tt[i] = new Thread() {
            public void run() {
                int k = j;
                for (int l = 0; l < workAmountPerThread; l++) {
                    k += k*k+l;
                }
                aa[j] = k;
            }
        };
    }
    System.out.println(" Done in "+(System.nanoTime()-t0)*1E-6+" ms.");
    System.out.print("Starting "+tt.length+" threads with "+workAmountPerThread+" steps of work per thread... ");
    t0 = System.nanoTime();
    for (int i = 0; i < tt.length; i++) { 
        tt[i].start();
    }
    System.out.println(" Done in "+(System.nanoTime()-t0)*1E-6+" ms.");
    System.out.print("Joining "+tt.length+" threads... ");
    t0 = System.nanoTime();
    for (int i = 0; i < tt.length; i++) { 
        tt[i].join();
    }
    System.out.println(" Done in "+(System.nanoTime()-t0)*1E-6+" ms.");
    long totalTime = System.nanoTime()-t00;
    int checkSum = 0; //display checksum in order to give the JVM no chance to optimize out the contents of the run() method and possibly even thread creation
    for (int a : aa) {
        checkSum += a;
    }
    System.out.println("Checksum: "+checkSum);
    System.out.println("Total time: "+totalTime*1E-6+" ms");
    System.out.println();
    return totalTime;
}

public static void main(String[] kr) throws InterruptedException {
    int workAmount = 100000000;
    int[] threadCount = new int[]{1, 2, 10, 100, 1000, 10000, 100000};
    int trialCount = 2;
    long[][] time = new long[threadCount.length][trialCount];
    for (int j = 0; j < trialCount; j++) {
        for (int i = 0; i < threadCount.length; i++) {
            time[i][j] = test(threadCount[i], workAmount/threadCount[i]); 
        }
    }
    System.out.print("Number of threads ");
    for (long t : threadCount) {
        System.out.print("\t"+t);
    }
    System.out.println();
    for (int j = 0; j < trialCount; j++) {
        System.out.print((j+1)+". trial time (ms)");
        for (int i = 0; i < threadCount.length; i++) {
            System.out.print("\t"+Math.round(time[i][j]*1E-6));
        }
        System.out.println();
    }
}
}

The results on 64-bit Windows 7 with 32-bit Sun's Java 1.6.0_21 Client VM on Intel Core2 Duo E6400 @2.13 GHz are as follows:

Number of threads  1    2    10   100  1000 10000 100000
1. trial time (ms) 346  181  179  191  286  1229  11308
2. trial time (ms) 346  181  187  189  281  1224  10651

Conclusions: Two threads do the work almost twice as fast as one, as expected since my computer has two cores. My computer can spawn nearly 10000 threads per second, i. e. thread creation overhead is 0.1 milliseconds. Hence, on such a machine, a couple of hundred new threads per second pose a negligible overhead (as can also be seen by comparing the numbers in the columns for 2 and 100 threads).

score 10 · Answer 2 · answered Jan 22 '10 at 12:23

10

First of all, this will of course depend very much on which JVM you use. The OS will also play an important role. Assuming the Sun JVM (Hm, do we still call it that?):

One major factor is the stack memory allocated to each thread, which you can tune using the -Xssn JVM parameter - you'll want to use the lowest value you can get away with.

And this is just a guess, but I think "a couple of hundred new threads every second" is definitely beyond what the JVM is designed to handle comfortably. I suspect that a simple benchmark will quickly reveal quite unsubtle problems.

answered Jan 22 '10 at 12:23

Michael Borgwardt

342,105
78
482
720

2

I find the notion of what `new Thread()` means to be an interesting one. In a modern JVM, `new Object()` doesn't always allocate new memory, it reuses previously garbage-collected objects. I wonder if there's any reason why the JVM couldn't have a hidden, internal pool of reusable threads, so that `new Thread()` doesn't necessarily create a new kernel thread. You'd get effective thread-pooling, without needing an API for it. – skaffman Jan 22 '10 at 14:13
2

If this is so, it should be found in some JSR. Might be 133 http://www.cs.umd.edu/~pugh/java/memoryModel/jsr133.pdf – Bozho Jan 22 '10 at 14:17
1

@skaffman Your hypothesis seems consistent with what I've been observing on at least osx/jdk1.6. A few times over the past few months I've raced thread pool + "new runnable" against a like sized semaphore + "new thread" and there never seems to be any measurable difference. The semaphore approach seems to sometimes edge out the pool approach, but the difference is so tiny and so rare that it really just stresses how like they are how hard you would have to work to get any difference between them. – David Blevins Jun 08 '11 at 07:02
if i always made a new Thread() for each 3 minutes once,.... ? Would it be also still expensive memory consumption for the app running in 24 hours ? @Michael, what would be the solutions....? Coz each 3 minutes one thread end, another one created.... would it be still expensive? :D – gumuruh Dec 07 '16 at 00:23
@gumuruh: one thread created every 3 minutes is no problem whatsoever. It could only become a problem if the threads don't end and their stack memory is not reclaimed. – Michael Borgwardt Dec 07 '16 at 08:43
i have for about 3-8 different object calling new Thread() each of them has at least 3 minutes differences. And their time (live) is also different one to another. When i tried to run the app under my Win7 with 2 GB ram, it takes very slow.... but when i decreased the object call, for becoming 2-3 it is not slow. I'm thinking this would be a thread memory problem as well.... @MichaelBorgwardt – gumuruh Dec 07 '16 at 10:25

Bozho · Answer 3 · 2010-01-22T14:13:42.517

1

for your benchmark you can use JMeter + a profiler, which should give you direct overview on the behaviour in such a heavy-loaded environment. Just let it run for a an hour and monitor memory, cpu, etc. If nothing breaks and the cpu(s) doesn't overheat, it's ok :)
perhaps you can get a thread-pool, or customize (extend) the one you are using by adding some code in order to have the appropriate InheritableThreadLocals set each time a Thread is acquired from the thread-pool. Each Thread has these package-private properties:
```
/* ThreadLocal values pertaining to this thread. This map is maintained
 * by the ThreadLocal class. */
ThreadLocal.ThreadLocalMap threadLocals = null;

/*
 * InheritableThreadLocal values pertaining to this thread. This map is
 * maintained by the InheritableThreadLocal class.  
 */ 
ThreadLocal.ThreadLocalMap inheritableThreadLocals = null;
```
You can use these (well, with reflection) in combination with the Thread.currentThread() to have the desired behaviour. However this is a bit ad-hock, and furthermore, I can't tell whether it (with the reflection) won't introduce even bigger overhead than just creating the threads.

edited Jan 22 '10 at 14:13

answered Jan 22 '10 at 12:31

Bozho

588,226
146
1,060
1,140

The transcription of threadlocals is something I did consider. In my particular case, however, I'm using `@Async` in Spring 3, which decouple the mechanics of the `Callable` from the business logic. It's very cool, but means you don't get access to the executor itself, or the tasks that get created. – skaffman Jan 22 '10 at 14:16
1

Did you check whether spring doesn't have some pluggable mechanism for replacing the executor impelementation? If not, then to further go hacking, you could try creating a class with the same qualified name as the one where you will eventually put your custom code, and let it be loaded instead of the original one. But that's a last resort. – Bozho Jan 22 '10 at 14:27
Hmmm, yes, Spring does allow you to specify the executor used for @Async, so yes, there's a way of passing across the threadlocals there, although as you said, it's still going to get pretty ugly. – skaffman Jan 22 '10 at 14:37

score 0 · Answer 4 · answered Jan 22 '10 at 12:38

0

I am wondering whether it is necessary to spawn new threads on each user request if their typical life-cycle is as short as a second. Could you use some kind of Notify/Wait queue where you spawn a given number of (daemon)threads, and they all wait until there's a task to solve. If the task queue gets long, you spawn additional threads, but not on a 1-1 ratio. It will most likely be perform better then spawning hundreds of new threads whose life-cycles are so short.

answered Jan 22 '10 at 12:38

Terje

1,753
10
13

1

What you're describing is a thread pool, which I already described in the question. – skaffman Jan 22 '10 at 14:03
If each Request thread acts as a ThreadPool, I guess I just don't see why you couldn't have a `private ThreadLocal local;` which you instanciate each time the Request thread wakes up, and when processing each worker thread, you use `local.set()` / `local.get()`, but it's likely I misunderstand your problem. – Terje Jan 22 '10 at 14:32

Java thread creation overhead

4 Answers4

Linked