How to determine optimal number of threads for high latency network requests?

Question

I am writing a utility that must make thousands of network requests. Each request receives only a single, small packet in response (similar to ping), but may take upwards of several seconds to complete. Processing each response completes in one (simple) line of code.

The net effect of this is that the computer is not IO-bound, file-system-bound, or CPU-bound, it is only bound by the latency of the responses.

This is similar to, but not the same as There is a way to determine the ideal number of threads? and Java best way to determine the optimal number of threads [duplicate]... the primary difference is that I am only bound by latency.

I am using an ExecutorService object to run the threads and a Queue<Future<Integer>> to track threads that need to have results retrieved:

ExecutorService executorService = Executors.newFixedThreadPool(threadPoolSize);
Queue<Future<Integer>> futures = new LinkedList<Future<Integer>>();

for (int quad3 = 0 ; quad3 < 256 ; ++quad3) {
    for (int quad4 = 0 ; quad4 < 256 ; ++quad4) {
        byte[] quads = { quad1, quad2, (byte)quad3, (byte)quad4 };
        futures.add(executorService.submit(new RetrieverCallable(quads)));
    }
}

... I then dequeue all the elements in the queue and put the results in the required data structure:

int[] result = int[65536]
while(!futures.isEmpty()) {
    try {
        results[i] = futures.remove().get();
    } catch (Exception e) {
        addresses[i] = -1;
    }
}

My first question is: Is this a reasonable way to track all the threads? If thread X takes a while to complete, many other threads might finish before X does. Will the thread pool exhaust itself waiting for open slots, or will the ExecutorService object manage the pool in such a way that threads that have completed but not yet been processed be moved out of available slots so that other threads my begin?

My second question is what guidelines can I use for finding the optimal number of threads to make these calls? I don't even know order-of-magnitude guidance here. I know it works pretty well with 256 threads, but seems to take roughly the same overall time with 1024 threads. CPU utilization is hovering around 5%, so that doesn't appear to be an issue. With that large a number of threads, what are all the metrics I should be looking at to compare different numbers? Obviously overall time to process the batch, average time per thread... what else? Is memory an issue here?

score 8 · Accepted Answer · edited May 23 '17 at 10:30

8

It will shock you, but you do not need any threads for I/O (quantitatively, this means 0 threads). It is good that you have studied that multithreading does not multiply your network bandwidth. Now, it is time to know that threads do computation. They are not doing the (high-latency) communication. The communication is performed by a network adapter, which is another process, running really in parallel with with CPU. It is stupid to allocate a thread (see which resources allocated are listed by this gentlemen who claims that you need 1 thread) just to sleep until network adapter finishes its job. You need no threads for I/O = you need 0 threads.

It makes sense to allocate the threads for computation to make in parallel with I/O request(s). The amount of threads will depend on the computation-to-communication ratio and limited by the number of cores in your CPU.

Sorry, I had to say that despite you have certainly implied the commitment to blocking I/O, so many people do not understand this basic thing. Take the advise, use asynchronous I/O and you'll see that the issue does not exist.

edited May 23 '17 at 10:30

Community

1
1

answered Oct 24 '13 at 12:04

Val

1
8
40
64

1

That's correct. The only problem is that one requires to build a framework on top of NIO2. – Andrey Chaschev Oct 24 '13 at 14:57
@AndreyChaschev I see. Building a framewrok on top of Actors is another thing. It does not require to build a framework on top of the Actors. Can I tell this? – Val Oct 24 '13 at 16:05
1

Well, it's opinionated topic. And my opinion is that threads are more intuitive for this type of job. And actors are green, lightweight threads. – Andrey Chaschev Oct 24 '13 at 16:09
Actors are brand-new to me, so I'm definitely working to understand them as we speak. @AndreyChaschev, your comment is appreciated, yes, threads are intuitive for this. If actors are really conceptually lightweight threads, then they are the approach I'm looking for. Thanks, both of you! – seawolf Oct 24 '13 at 22:11
Yes, they indeed are - [Scala: Why are Actors lightweight?](http://stackoverflow.com/questions/2439252/scala-why-are-actors-lightweight). Really interesting what you will come up with in the end. – Andrey Chaschev Oct 24 '13 at 22:42
To clarify, the 1 thread I meant is the one the system gives you when it calls your main or whatever. You are right that you pretty much never want to make an explicit threading call when doing I/O, only computation. – soru Oct 25 '13 at 00:06

score 7 · Answer 2 · edited Jun 20 '20 at 09:12

As mentioned in one of the linked answers you refer to, Brian Goetz has covered this well in his article.

He seems to imply that in your situation you would be advised to gather metrics before committing to a thread count.

Tuning the pool size

Tuning the size of a thread pool is largely a matter of avoiding two mistakes: having too few threads or too many threads. ...

The optimum size of a thread pool depends on the number of processors available and the nature of the tasks on the work queue. ...

For tasks that may wait for I/O to complete -- for example, a task that reads an HTTP request from a socket -- you will want to increase the pool size beyond the number of available processors, because not all threads will be working at all times. Using profiling, you can estimate the ratio of waiting time (WT) to service time (ST) for a typical request. If we call this ratio WT/ST, for an N-processor system, you'll want to have approximately N*(1+WT/ST) threads to keep the processors fully utilized.

My emphasis.

I had missed that link, and I'm glad you pointed it out to me... lots of good stuff. I'm definitely in the camp that goes beyond the formula you provided, N*(1+WT/ST) is such a large number that I run out of memory if I use it. It seems I'm bound to about 1200-1300 threads on my computer before resource exhaustion blocks me, so I'll have to do profiling on running processes with large numbers of threads to find an optimal range. — seawolf, Oct 24 '13 at 11:37

score 3 · Answer 3 · edited May 23 '17 at 12:02

Have you considered using Actors?

Best practises.

Actors should be like nice co-workers: do their job efficiently without bothering everyone else needlessly and avoid hogging resources. Translated to programming this means to process events and generate responses (or more requests) in an event-driven manner. Actors should not block (i.e. passively wait while occupying a Thread) on some external entity—which might be a lock, a network socket, etc.—unless it is unavoidable; in the latter case see below.

Sorry, I can't elaborate, because haven't much used this.

UPDATE

Answer in Good use case for Akka might be helpful.
Scala: Why are Actors lightweight?

score 2 · Answer 4 · answered Oct 24 '13 at 10:37

2

Pretty sure in the described circumstances, the optimal number of threads is 1. In fact, that is surprisingly often the answer to any quesion of the form 'how many threads should I use'?

Each additonal thread adds extra overhead in terms of stack (and associated GC roots), context switching and locking. This may or not be measurable: the effor to meaningfully measure it in all target envoronments is non-trivial. In return, there is little scope to provide any benifit, as processing is neither cpu nor io-bound.

So less is always better, if only for reasons of risk reduction. And you cant have less than 1.

answered Oct 24 '13 at 10:37

soru

5,464
26
30

1

That's flat-out incorrect in this case. If I run with a single thread, and assuming a 3-second turnaround for each request (they average about 3, with a standard deviation of about 1 second), making 65,000 requests will take approximately 54 *hours*. Running my program with 1024 threads takes about 20 minutes. Less is absolutely *not* better in the described conditions. – seawolf Oct 24 '13 at 21:45
I'll rephrase that to say that answer is *misleading*. I have a provable case where increasing the number of threads radically reduces overall job completion time. A different approach would be using asynchronous IO, which (if I understand correctly) can have similar or greater performance impact. In terms of simplicity, however, threads are quicker to get up and running, and in many cases are appropriate for jobs of this type but smaller scale. – seawolf Oct 24 '13 at 21:53
OK, I'm humble enough that I can say I misunderstood your post and agree with what you are getting at. In the future it would help if you included an example of a replacement strategy. As pointed out in [this reply](http://stackoverflow.com/a/19565450/1886109), Actors are conceptually lightweight threads and if I can use that or some similar concurrency model then I agree, true threads are overkill. "Conceptually" is the key word here, I need some sort of concurrency, and threads are one reasonable way to achieve that, just not at the scale that I need. – seawolf Oct 24 '13 at 22:24
Sorry, I thought the idea that you would use async I/O for something like this went without saying. – soru Oct 24 '13 at 23:58
It probably should have, but I've done very little with concurrency over the years... that's always been someone else's job. So yeah, I knew concurrency was necessary, but async I/O didn't immediately spring to mind. – seawolf Oct 26 '13 at 01:25

score 2 · Answer 5 · answered Oct 24 '13 at 11:21

I assume the desired optimization is the time to process all requests. You said the number of requests is "thousands". Evidently, the fastest way is to issue all requests at once, but this may overflow the network layer. You should determine how many simultaneous connections can network layer bear, and make this number a parameter for your program.

Then, spending a thread for each request require a lot of memory. You can avoid this using non-blocking sockets. In Java, there are 2 options: NIO1 with selectors, and NIO2 with asynchronous channels. NIO1 is complex, so better find a ready-made library and reuse it. NIO2 is simple but available only since JDK1.7.

Processing the responses should be done on a thread pool. I don't think the number of threads in the thread pool greatly affects the overall performance in your case. Just make tuning for thread pool size from 1 to the number of available processors.

I've just started reading about NIO and selectors, and this looks like it's probably the right track for what I'm doing. Thanks! Any particular ready-made libraries that you've worked with and can recommend investigating? — seawolf, Oct 24 '13 at 21:47
I did not encountered a library that I would like, so I started my own at https://github.com/rfqu/pipeline/tree/master/pipeline-nio At least, it is compact and easy to understand. — Alexei Kaigorodov, Oct 25 '13 at 05:44

score 1 · Answer 6 · answered Oct 24 '13 at 11:05

In our high-performance systems, we use the actor model as described by @Andrey Chaschev.

The no. of optimal threads in your actor model differ with your CPU structure and how many processes (JVMs) do you run per box. Our finding is

If you have 1 process only, use total CPU cores - 2.
If you have multiple process, check your CPU structure. We found its good to have no. of threads = no. of cores in a single CPU - e.g. if you have a 4 CPU server each server having 4 cores, then using 4 threads per JVM gives you best performance. After that, always leave at least 1 core to your OS.

score 0 · Answer 7 · edited May 23 '17 at 11:55

0

An partial answer, but I hope it helps. Yes, memory can be an issue: Java reserves 1 MB of thread stack by default (at least on Linux amd64). So with a few GB of RAM in your box, that limits your thread count to a few thousand.

You can tune this with a flag like -XX:ThreadStackSize=64. That would give you 64 kB, which is plenty in most situations.

You could also move away from threading entirely and use epoll to respond to incoming responses. This is far more scalable but I have no practical experience with doing this in Java.

edited May 23 '17 at 11:55

Community

1
1

answered Oct 24 '13 at 09:46

Thomas

174,939
50
355
478

1

The problem is that you can only set the thread stack size globally, and for an application's main threads, 64kb can be very inadequate - just look at the stack traces of a typical Spring or EJB-based app. – Michael Borgwardt Oct 24 '13 at 10:01
Unfortunately I'm running in an environment where I'm limited to 4GB total RAM, so even if I increase thread stack size I can't increase my total thread availability by much. – seawolf Oct 24 '13 at 11:42

How to determine optimal number of threads for high latency network requests?

7 Answers7