Memory settings with thousands of threads

Question

I'm playing around with the JVM (Oracle 1.7 64 bit) on a Linux box (AMD 6 Core, 16 GB RAM) to see how the number of threads in an application affects performance. I'm hoping to measure at which point context switching degrades performance.

I have created a little application that creates a thread execution pool:

Executors.newFixedThreadPool(numThreads)

I adjust numThreads everytime I run the program, to see the effect it has.

I then submit numThread jobs (instances of java.util.concurrent.Callable) to the pool. Each one increments an AtomicInteger, does some work (creates an array of random integers and shuffles it), and then sleeps a while. The idea is to simulate a web service call. Finally, the job resubmits itself to the pool, so that I always have numThreads jobs working.

I am measuring the throughput, as in the number of jobs that are processed per minute.

With several thousand threads, I can process up to 400,000 jobs a minute. Above 8000 threads, the results start to vary a lot, suggesting that context switching is becoming a problem. But I can continue to increase the number of threads to 30,000 and still get higher throughput (between 420,000 and 570,000 jobs per minute).

Now the question: I get a java.lang.OutOfMemoryError: Unable to create new native thread with more than about 31,000 jobs. I have tried setting -Xmx6000M which doesn't help. I tried playing with -Xss but that doesn't help either.

I've read that ulimit can be useful, but increasing with ulimit -u 64000 didn't change anything.

For info:

[root@apollo ant]# ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 127557
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 1024
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

So the question #1: What do I have to do to be able to create a bigger thread pool?

Question #2: At what stage should I expect to see context switching really reducing throughput and causing the process to grind to a halt?

Here are some results, after I modified it to do a little more processing (as was suggested) and started recording average response times (as was also suggested).

// ( (n_cores x t_request) / (t_request - t_wait) ) + 1
// 300 ms wait, 10ms work, roughly 310ms per job => ideal response time, 310ms
// ideal num threads = 1860 / 10 + 1 = 187 threads
//
// results:
//
//   100 =>  19,000 thruput,  312ms response, cpu < 50%
//   150 =>  28,500 thruput,  314ms response, cpu 50%
//   180 =>  34,000 thruput,  318ms response, cpu 60%
//   190 =>  35,800 thruput,  317ms response, cpu 65%
//   200 =>  37,800 thruput,  319ms response, cpu 70%
//   230 =>  42,900 thruput,  321ms response, cpu 80%
//   270 =>  50,000 thruput,  324ms response, cpu 80%
//   350 =>  64,000 thruput,  329ms response, cpu 90%
//   400 =>  72,000 thruput,  335ms response, cpu >90%
//   500 =>  87,500 thruput,  343ms response, cpu >95%
//   700 => 100,000 thruput,  430ms response, cpu >99%
//  1000 => 100,000 thruput,  600ms response, cpu >99%
//  2000 => 105,000 thruput, 1100ms response, cpu >99%
//  5000 => 131,000 thruput, 1600ms response, cpu >99%
// 10000 => 131,000 thruput, 2700ms response, cpu >99%,  16GB Virtual size
// 20000 => 140,000 thruput, 4000ms response, cpu >99%,  27GB Virtual size
// 30000 => 133,000 thruput, 2800ms response, cpu >99%,  37GB Virtual size
// 40000 =>       - thruput,    -ms response, cpu >99%, >39GB Virtual size => java.lang.OutOfMemoryError: unable to create new native thread

I interpret them as:

1) Even though the application is sleeping for 96.7% of the time, that still leaves lots of thread switching to be done 2) Context switching is measurable, and is shown in the response time.

What is interesting here is that When tuning an app, you'd might choose the acceptable response time, say 400ms, and increase number of threads until you get that response time, which in this case would let the app process around 95 thousand requests a minute.

Often people say that the ideal number of threads is near the number of cores. In apps that have wait time (blocked threads, say waiting for a database or web service to respond), the calculation needs to consider that (see my equation above). But even that theoretical ideal isn't an actual ideal, when you look at the results or when you tune to a specific response time.

Out of curiousity, have you tried seeing what the performance looks like with only as many threads as you have cores? (Minus a couple, maybe?) It depends a LOT on what the actual work being performed is, but if it's not I/O bound, you may be surprised... — T.J. Crowder, Jan 27 '13 at 09:06
For question 2, you shouldn't max out throughput. You can stop adding more thread when the increase in number of threads results in more increase the response time than the increase in utilization/throughput. — nhahtdh, Jan 27 '13 at 09:07
To get maximum performance, the number of threads is roughly ( (n_cores x t_request) / (t_request - t_wait) ) + 1 where n_cores is the number of cores, t_request the time taken to process and wait for the web service response, and t_wait is the time taken waiting for the web service response. So the optimum is indeed quite close to the number of cores, but since I am simulating waiting for a web service response, its higher than just the number of cores. The point is though, that I want to show that lots of threads causes a problem, and I don't seem to be able to do that :-) — Ant Kutschera, Jan 27 '13 at 09:12
If you increase the processing time and reduce the waiting time of each thread, then you'll start having many context switches. If all your threads are sleeping, there are not so many context switches. — JB Nizet, Jan 27 '13 at 09:17
Heuristic approach? Use a thread pool class where it's easy to add/remove threads. Start with one thread, leave the app running for some interval, record/log the performance. Double the number of threads in the pool. Keep going till it crashes. Repeat until results are statistically significant. Modify the 'increse number of threads' algorithm to get a better estimate. Maybe reduce the number of threads if you can detect that performance is degrading in an attempt to get a more accurate value quicker. Let us know the results! — Martin James, Jan 27 '13 at 10:07
I have found a similar limit for Java processes under Linux. I don't know the cause but I don't believe it is a lack of resources. I haven't worried about it too much because I have never needed that many threads to keep all the cores busy and adding any more would just add overhead. i.e. I suspect there is no particularly good application for so many threads. — Peter Lawrey, Jan 27 '13 at 18:09

Stephen C · Answer 1 · 2013-01-27T10:53:26.023

I get a java.lang.OutOfMemoryError: Unable to create new native thread with more than about 31,000 jobs. I have tried setting -Xmx6000M which doesn't help. I tried playing with -Xss but that doesn't help either.

The -Xmx setting won't help because thread stacks are not allocated from the heap.

What is happening is that the JVM is asking the OS for a memory segment (outside of the heap!) to hold the stack, and the OS is refusing the request. The most likely reasons for this are a ulimit or an OS memory resource issue:

The "data seg size" ulimit, is unlimited, so that shouldn't be the problem.
So that leaves memory resources. 30,000 threads at 1Mb a time is ~30Gb which is a lot more physical memory than you have. My guess is that there is enough swap space for 30Gb of virtual memory, but you have pushed the boundary just a bit too far.

The -Xss setting should help, but you need to make the requested stack size LESS than the default size of 1m. And besides there is a hard minimum size.

Question #1: What do I have to do to be able to create a bigger thread pool?

Decrease the default stack size below what it currently is, or increase the amount of available virtual memory. (The latter is NOT recommended since it looks like you are already seriously over-allocating already.)

Question #2: At what stage should I expect to see context switching really reducing throughput and causing the process to grind to a halt?

It is not possible to predict that. It will be highly dependent on what the threads are actually doing. And indeed, I don't think that your benchmarking is going to give you answers that will tell you how a real multi-threaded application is going to behave.

The Oracle site says this on the topic of thread stackspace:

In Java SE 6, the default on Sparc is 512k in the 32-bit VM, and 1024k in the 64-bit VM. On x86 Solaris/Linux it is 320k in the 32-bit VM and 1024k in the 64-bit VM.

On Windows, the default thread stack size is read from the binary (java.exe). As of Java SE 6, this value is 320k in the 32-bit VM and 1024k in the 64-bit VM.

You can reduce your stack size by running with the -Xss option. For example:

  java -server -Xss64k

Note that on some versions of Windows, the OS may round up thread stack sizes using very coarse granularity. If the requested size is less than the default size by 1K or more, the stack size is rounded up to the default; otherwise, the stack size is rounded up to a multiple of 1 MB.

64k is the least amount of stack space allowed per thread.

If I reduce the stack size to 160k, the minimum allowed by the JVM, then I get a stack overflow on starting up the app. With 200k, I can see that the virtual memory has reduced from nearly 40 gigs down to only 13 gigs (using system monitor). But I still get the same error. Funnily enough Eclipse starts to get OOME (I am running these tests from the command line, not in Eclipse!), so that indicates that the entire system really is being pushed to its limits. Something doesn't quite add up though, with what you have answered, because reducing stack size hasn't let me create more threads. — Ant Kutschera, Jan 27 '13 at 12:41
There may be other limits; e.g. an OS imposed limit on the number of native threads per process, or overall. — Stephen C, Sep 25 '15 at 07:23
I have read that on some JVMs, the thread stack can be on the heap because JVM spec does not force the separation of those two(like [this answer](https://stackoverflow.com/questions/36898701/how-does-java-jvm-allocate-stack-for-each-thread)), but other sources suggests what you said: thread stack is not on the JVM heap and if we increase the `-Xmx` and `-Xms` we may reduce the memory that we need to create threads. So I think we should look into the spec of each JVM? — WesternGun, Feb 27 '18 at 08:53
Different JVMs may implement thread stacks differently. What I wrote above only applies to JVMs that use the HotSpot / OpenJDK codebase. But obviously, memory that is used for one thing (e.g. objects) cannot be used for something else as well (e.g. stacks). Ditto for address space. — Stephen C, Feb 27 '18 at 12:20

score 2 · Answer 2 · answered Jan 27 '13 at 09:26

2

Here some of the points/ways, I would have followed:

Take a look at the data used in context switches. Instead of boolean or string try to use some big List or Map.
Instead of trying to created fixed pool right at starting time try with cached pool.
Instead of letting threads dying out after doing some small work, let them be alive and come back to do small chunks of work again and again.
Try to keep processing time of thread higher.

answered Jan 27 '13 at 09:26

Prateek

523
2
13

For point 3: the threads don't die, since they are in a pool, and so get re-used. For point 4: I am analysing a case where we are waiting for a remote process to respond, be it a web service or a database. I.e. I want blocking to be part of the process which I am analysing. – Ant Kutschera Jan 27 '13 at 09:36
@Ant for point 3: I meant provide work to threads so that they are in continuation for longer periods. Threads should take small pauses and then come back. You should also take care about the internal data structure used for inter thread communication for example, blocking queue or ring buffer. You might want to take a look at http://code.google.com/p/disruptor/ – Prateek Jan 28 '13 at 03:21

Memory settings with thousands of threads

2 Answers2

Linked