0

I am currently working on performance of a distributed application. I am targeting a network component. Currently for each connection there is a dedicated thread which handles a socket in blocking mode. My goal is to reduce number of threads(without reducing performance) and if possible improve performance.

I redesigned the network component to use async communication and am trying to use 1 to 2 threads for entire network processing. I did a simple test where I wrote in a loop from one node and read on another, This was to test max nw thread capability and I found that my busy loop implementation was consuming 100% cpu and was getting much more operations per sec then we require. So I integrated this busy loop implementation in the existing application.

Problem I found is that other threads are not allowing these async nw threads to aquire full cpu, even though I have a 8 core system and we are not using more than 400% cpu. Basically being a C programmer I would have solved this by binding my nw thread on a core and raising its scheduling priority, so that other threads can still run on other core. I am not able to do similar in Java. There are conflicting comments on Java thread priority. Also I do not want to reduce the priority of other threads as it may have its own side affect.

How would you solve this problem?

Michael Petrotta
  • 59,888
  • 27
  • 145
  • 179
  • [See this post](http://stackoverflow.com/questions/2238272/java-thread-affinity), which is about a way to set a thread's processor affinity, when you're using Java, though it's really using JNI instead to do the job. – Chris O Aug 09 '12 at 17:30
  • You want to have two threads handle all of the network traffic. At most that would use two cores at 100%. You are running four cores at 100% so from the information here it seems possible that you don't have a problem. What makes you think you have a problem? – Jon Strayer Aug 09 '12 at 17:49
  • sorry for not being explicit. The nw traffic is currently handled by 8 threads and together they do not consume more then 100% cpu. 400% is entire applications load. I am thinking of reducing number of threads by using single(or 2 ) async nw thread. – user1588261 Aug 09 '12 at 17:59

2 Answers2

3

I have a library to supports thread affinity in Java on Linux and Windows. https://github.com/peter-lawrey/Java-Thread-Affinity

If you isolate the CPUs you can ensure the CPUs you assign will not be using for anything else (other than non-maskable interrupts) This works best in Linux AFAIK.


You can get lower latency results if you use busy waiting with non-blocking NIO than blocking IO. The later works best under load, at lower loads the latency can increase.

You might find this library interesting https://github.com/peter-lawrey/Java-Chronicle it allows you to persist millions of messages per second, optionally to a second process.

BTW: Thread priority is just a hint, the OS is free to ignore it (and often does)


A simple example comparing warm vs cold code. All it does is copy an array repeatedly and time it. Once the code and data has warmed you wouldn't expect it to slow, but all it takes is a 10 ms delay even on a quite machine to slow the time it takes to do the copy significantly.

public static void main(String... args) throws InterruptedException {
    int[] from = new int[60000], to = new int[60000];
    for (int i = 0; i < 10; i++)
        copy(from, to); // warm up
    for (int i = 0; i < 10; i++) {
        long start = System.nanoTime();
        copy(from, to);
        long time = System.nanoTime() - start;
        System.out.printf("Warm copy %,d us%n", time / 1000);
    }
    for (int i = 0; i < 10; i++) {
        Thread.sleep(10);
        long start = System.nanoTime();
        copy(from, to);
        long time = System.nanoTime() - start;
        System.out.printf("Cold copy %,d us%n", time / 1000);
    }
}

private static void copy(int[] a, int[] b) {
    for (int i = 0, len = a.length; i < len; i++)
        b[i] = a[i];
}

prints

Warm copy 20 us
Warm copy 20 us
Warm copy 19 us
Warm copy 23 us
Warm copy 20 us
Warm copy 20 us
Cold copy 100 us
Cold copy 80 us
Cold copy 89 us
Cold copy 92 us
Cold copy 80 us
Cold copy 112 us
Peter Lawrey
  • 525,659
  • 79
  • 751
  • 1,130
  • can I ask just to be sure? So there is a way, with Java, to assign a CPU to do only YOUR work? – Eugene Aug 09 '12 at 18:12
  • On Linux you can tell the OS not to schedule anything on a CPU(s), then you can assign a specific threads (or process) to that CPU(s) Its indirect but it works very nicely. I haven't the same controls for Windows or MAC OS – Peter Lawrey Aug 09 '12 at 18:30
  • this is amazing, I did not know about this and so eager to test your library that you wrote. Thank you! – Eugene Aug 09 '12 at 18:32
  • BTW it supports binding/reserving a whole core if you have hyperthreading enabled but don't want one or two critical thread(s) to share the core with another thread. – Peter Lawrey Aug 09 '12 at 18:35
  • Thanks I will try the stuff. can you pls elaborate the second part – user1588261 Aug 09 '12 at 18:43
  • second part "ou can get lower latency results if you use busy waiting with non-blocking NIO than blocking IO. The later works best under load, at lower loads the latency can increase" – user1588261 Aug 09 '12 at 18:45
  • Whenever you perform a OS call, whenever you give up the CPU by blocking, your caches get disturbed, even clearer. The later only takes 10 ms. When the caches are cleared and your code is running cold it can take 2-4 times longer to run the same code. – Peter Lawrey Aug 09 '12 at 18:48
  • Added a simple example of what I mean. – Peter Lawrey Aug 09 '12 at 18:59
1

This really smacks of premature optimization to me. You have an 8 core system and are only using 400% CPU. What makes you think that this is not a textbook example of IO bound program? What makes you think that you've not maxed out your network IO chain?

@Peter knows his stuff and I'm sure you can hack processor affinity and force your critical threads to a single CPU but the question is will it make your program run any faster? I sincerely doubt it. The model Java VM is very smart about thread scheduling and I suggest that it is doing its job appropriately. Unless you have very good evidence to the contrary, I would let it handle the scheduling. Even priorities mean very little if most of the threads are waiting for IO.

Also, what makes you think that reducing the number of threads is somehow better. This moves a lot of code from native land (i.e. thread multiplexing) into Java land (i.e. NIO code). If you are talking about 1000s of threads then I'd agree but even 100s of threads should be an efficient way to handle the connections.

I've done a ton of thread programming for more than two decades and I've never had to force thread affinity. Certainly sizing thread-pools and making good decisions of where to apply thread-pools versus dedicated threads is an art but force the VM to schedule the threads the way you think they should be is just not a good use of your time. Spent some time with a profiler to find out where your program is spending its time would be a better investment IMHO.

Gray
  • 115,027
  • 24
  • 293
  • 354
  • In my experiment I have only 1 async network thread which is doing busy loop across all the end points. I was expecting that this thread will grab 100% cpu . This is exactly the case when I run this test in isoloation but when I turn on other functionality and hence more threads then situtation is lot different. It seems linux scheduler do not allow this thread to be scheduled for 100% even though on a 8 core system load is never more then 400%. – user1588261 Aug 10 '12 at 05:09
  • You are doing busy loops across all of the end points? Wow. Seems like a really questionable architecture. Was a thread per end-point really untenable? How many end-points are there? If you are not using NIO to replace a per-thread end-point then I can't believe you way is better somehow. But it's hard to know @user1588261 without knowing the details. – Gray Aug 10 '12 at 14:07
  • @Gray am wondering if you think that this is premature optimization even for applications that run on NUMA architectures, with a DB pull pipeline execution model (read_From_Cache-> decompressing -> decoding -> build_hashmap -> Join -> ....) and i want to use a thread per processing stage, wouldn't make more sense to keep all the threads within one Numa region at least ? Thanks! – Slim Bouguerra May 20 '19 at 16:49
  • I don't know much about the particulars with NUMA @SlimBouguerra. Anything somewhat loosely coupled (i.e. not inside the same silicon) has to pay for coordination and that needs to be factored into a threading cost/benefit analysis. If you have a CPU bound process then fine but the more your job is tied to IO or the more the threads will need to cross-talk to get the job done, the less likely you will see a significant optimization bump when you thread-ify your application. Hope this helps. – Gray May 20 '19 at 21:10