Java: counting/profiling synchronization calls

Question

I am searching for a way to list all synchronization calls of a running parallel java application, in order to detect scalability problems (in terms of threads/cores). To my understanding, each time a synchronized block is entered, the machine needs to synchronize caches. This affects all CPUs running (in several ways, like memory bandwidth), even if the running tasks are not blocked by entering a synchronized region.

The setting:

I have a large application that is parallelized on a higher level, i.e. it has complex tasks that are executed in parallel. The parallelization works in the term, that all cores are under load and i have no blocked threads. Still the performance is not scaling with the cores, which may have several reasons. The particular possible reason I am interested in, is if there are a lot of synchronization calls (e.g. entering a synchronized block, using a lock, etc).

The Task

I would like to find out which places in my code (that are actually executed) have such a synchronization call and how often each synchronization is actually performed. There are many referenced libraries, therefore it is not possible to just use a regular code search on the synchronized keyword or something like that, becuase this will search a lot of never executed code and bring up a lot of false positives. The perfect solution would be to have a profiler, that lists all executed synchronization places and the number of calls. However, the profilers i tried out only allow method calls to be counted. Therefore, the problem here is to find all methods, that are actually relevant.

Alternatively, it would also help if i can just find the synchronization places referenced by some entry point (main method). I.e. by recursively going through the code and check all referenced methods, classes, etc for such synchronization. In this case it would be possible to find out the frequency later by using a regular profiler.

The Question

Are there tools or workflows, that are able to archive the task above for a larger project.

THX in advance for your answers.

score 4 · Answer 1 · answered Nov 17 '15 at 16:27

4

You can use a CPU profiler to do this. If you have a synchronized method which is taking a long time to get a lock it will appear to be taking a long time. If it is not taking a long time, you shouldn't care about it.

Now, if a method is taking a long time, it might not be clear as to whether it is the synchronized lock or not. If you really can't tell run reading the code, you can have the implementation moved to a private method and all the public method does is acquire the lock. This will make it much clearer as to whether the delay is in obtaining the lock or running the code.

Another reason to use a profiler; when you guess at what the problem might be, it is almost never what you first thought, even if you have been performance tuning Java programs for ten years, what you think of first might be in the top 5 or top 10 but is rarely the biggest problem you have.

answered Nov 17 '15 at 16:27

Peter Lawrey

525,659
79
751
1,130

the problem here is, that no blocking threads are shown and the threads have always work, but still performance does not scale. therefore i thought that the locking may cause some addition memory traffic that also affects other memory access. – Till Schäfer Nov 18 '15 at 17:38
@TillSchäfer You are right. A common cause of poor scalability for CPU intensive processes is down to L3 cache contention. It is 10-20x slower than L1 and it is a shared resource. If you have 10 cores and 1% of all accesses touch the L3 cache, the average time for a memory access doubles. i.e. 99% * 1 + 1% * 10 * (10 to 20) Once your L3 cache is fully utilized, adding more threads won't help much and could make things worse. – Peter Lawrey Nov 19 '15 at 09:35
@TillSchäfer What we do is minimise the amount of data shared between threads and minimise garbage as much as possible as this improves the efficiency of the L1/L2 caches and you touch the L3 less. In short, do a memory profile and reduce your garbage rate. – Peter Lawrey Nov 19 '15 at 09:35

score 3 · Accepted Answer · edited May 23 '17 at 11:53

3

Entering and leaving synchronized block is rather cheap operation unless there is a contention on this block. In uncontended case synchronized is just an atomic CAS or almost a no-op if UseBiasedLocking optimization succeeds. Though it looks possible to do a synchronization profiler using Instrumentation API, this won't make much sense.

The problem for multithreaded application is a contended synchronization. JVM has some internal counters to monitor lock contention (see this question). Or you can even write a simple ad-hoc tool to track all contended locks using JVMTI events.

However, not only locks can cause contention. Even non-blocking algorithms can suffer from competition for a shared resource. Here is a good example of such scalability problem. So, I would agree with @PeterLawrey that it's better to start with CPU profiler as it is usually more handy in finding performance problems.

edited May 23 '17 at 11:53

Community

1
1

answered Nov 18 '15 at 01:37

apangin

92,924
10
193
247

with contention you mean, that one thread is actually blocked by another, right? THX for the tips. They really help. – Till Schäfer Nov 18 '15 at 17:34
do you think entering a synchronized block has an impact on memory bandwidth problems? if the memory is the bottleneck, a cache sync may add some more traffic here, or am i wrong? – Till Schäfer Nov 18 '15 at 17:40
@TillSchäfer No, `synchronized` block is not *that* hard. I doubt it can be a root problem of memory traffic. There is no a *"cache sync"* thing, but rather a [cache coherence protocol](https://en.wikipedia.org/wiki/MESI_protocol) and related [memory barriers](https://en.wikipedia.org/wiki/MESI_protocol#Memory_Barriers). They do cost something, but not too much, and they deal with certain cache lines, not with the whole memory. – apangin Nov 18 '15 at 23:17
I think this answer is wrong. For example replacing of the BufferedOutputStream / BufferedInputStream with a non synchronized version has improved the speed of our code with a factor 4. Of course the instances was use only from a single thread. – Horcrux7 Nov 10 '16 at 10:24
@Horcrux7 If you have a counterexample, please share the benchmark code, so I could reply or explain why this happens. I did a presentation that showed how different benchmarks (even when made by good tools) can lead to completely opposite results. There was about 'synchronized' performance, too. The presentation is not in English, but here are [a few slides](https://drive.google.com/open?id=0B7LUxTf0AV-fbkJKVUszVVFKakU) to give you an idea of what could go wrong. – apangin Nov 10 '16 at 17:58
@apangin See the sample code in the separate answer http://stackoverflow.com/a/40546105/12631 – Horcrux7 Nov 11 '16 at 10:35

score 1 · Answer 3 · answered Nov 17 '15 at 20:23

1

It's a java application so you can use jdk tools in jdk1.8.XX.XX\bin. With visualVM or jmc (java mission control) you can visualize threads, lock. And you can add logs in the application (log4j or other tool) to count execution time.

answered Nov 17 '15 at 20:23

Mr_Thorynque

1,749
1
20
31

score 0 · Answer 4 · answered Nov 11 '16 at 10:34

The follow very simple sample show that synchronized with a single monitor can cost some time also in single thread algorithm. It show that the synchronized BufferedOutputStream in this sample is approximate 1/4 slower. It streamed 100 MB to a nop stream. More complex code can produce more performance decrements.

import java.io.BufferedOutputStream;
import java.io.IOException;
import java.io.OutputStream;

public class BenchmarkTest {

    public static void main( String[] args ) throws IOException {
        while( true ) {
//            testNop();
            testSync();
            testNoSync();
        }
    }

    static void testNop() throws IOException {
        BenchmarkTest test = new BenchmarkTest();
        test.out = new OutputStream() {
            @Override
            public void write( int b ) throws IOException {
                // nop
            }
        };
        test.run( "            nop OutputStream" );
    }

    static void testSync() throws IOException {
        BenchmarkTest test = new BenchmarkTest();
        test.out = new BufferedOutputStream( new OutputStream() {
            @Override
            public void write( int b ) throws IOException {
                // nop
            }
        }, 32768 );
        test.run( "   sync BufferedOutputStream" );
    }

    static void testNoSync() throws IOException {
        BenchmarkTest test = new BenchmarkTest();
        test.out = new FastBufferedOutputStream( new OutputStream() {
            @Override
            public void write( int b ) throws IOException {
                // nop
            }
        }, 32768 );
        test.run( "no sync BufferedOutputStream" );
    }

    private OutputStream out;

    void run( String testName ) throws IOException {
        long time = System.currentTimeMillis();
        for( int i = 0; i < 100_000_000; i++ ) {
            out.write( i );
        }
        System.out.println( testName + " time: " + (System.currentTimeMillis() - time) );
    }

    static public class FastBufferedOutputStream extends OutputStream {
        private byte[]       buf;

        private int          count;

        private OutputStream out;

        /**
         * Creates a BufferedOutputStream without synchronized.
         *
         * @param out the underlying output stream.
         */
        public FastBufferedOutputStream( OutputStream out ) {
            this( out, 8192 );
        }

        /**
         * Creates a BufferedOutputStream without synchronized.
         *
         * @param out the underlying output stream.
         * @param size the buffer size.
         * @exception IllegalArgumentException if size &lt;= 0.
         */
        public FastBufferedOutputStream( OutputStream out, int size ) {
            this.out = out;
            this.buf = new byte[size];
        }

        /**
         * Flush the internal buffer
         * 
         * @throws IOException if any I/O error occur
         */
        private void flushBuffer() throws IOException {
            if( count > 0 ) {
                out.write( buf, 0, count );
                count = 0;
            }
        }

        /**
         * {@inheritDoc}
         */
        @Override
        public void write( int b ) throws IOException {
            if( count >= buf.length ) {
                flushBuffer();
            }
            buf[count++] = (byte)b;
        }

        /**
         * {@inheritDoc}
         */
        @Override
        public void write( byte[] b, int off, int len ) throws IOException {
            if( len >= buf.length ) {
                /* If the request length exceeds the size of the output buffer,
                   flush the output buffer and then write the data directly.
                   In this way buffered streams will cascade harmlessly. */
                flushBuffer();
                out.write( b, off, len );
                return;
            }
            if( len > buf.length - count ) {
                flushBuffer();
            }
            System.arraycopy( b, off, buf, count, len );
            count += len;
        }

        /**
         * {@inheritDoc}
         */
        @Override
        public void flush() throws IOException {
            flushBuffer();
            out.flush();
        }

        /**
         * {@inheritDoc}
         */
        @Override
        public void close() throws IOException {
            flushBuffer();
            out.close();
        }
    }
}

Unfortunately, this example collects many common [benchmarking pitfalls](http://www.oracle.com/technetwork/articles/java/architect-benchmarking-2266277.html). Please take a look at [this question](http://stackoverflow.com/questions/504103/how-do-i-write-a-correct-micro-benchmark-in-java). This code can be [modified](https://gist.github.com/apangin/b02ffe663a64d3d9ed8c2cfa680d61dc) so that both `sync` and `noSync` implementations show the same timings. — apangin, Nov 11 '16 at 11:22
@apangin Your sample has nothing to do with read world code. 1.) Never will add a synchronized to a single thread code. 2.) Most time you have no access on top of your code to all the synchronized objects in the deep of your code. For example if I add a ZipOutputStream or a DataOutputStream around the BufferedOutputStream. 3.) Your changes has only an effect on a 64 Bit VM. With a 32 Bit VM there is no such effect. There is continue a difference from 1/4. Which is very mystic. — Horcrux7, Nov 11 '16 at 13:09
Even without outer `synchronized` the difference is about 1.25x. Not 4x that you claimed before. Considering that your benchmark also has nothing to do with the real word use cases (i.e. it measures just synchronization with no useful work), this still means that uncondended synchronized calls are rather cheap. — apangin, Nov 11 '16 at 13:39
@apangin I have write that this test show only 1/4 or 0.25 difference. The factor 4 have we in our real application. It is a web server application. I think the problem is that there are run multiple task parallel. Every task is single thread but the synchronized calls of one thread slow also the other tasks. The effect increment with the count of threads. Another negative effect of a frequently count of synchronized can be the GC. My test ignored this. I does not like simplified benchmark test. I test with the read application. There are to many effects. — Horcrux7, Nov 11 '16 at 13:53
I don't want to speculate about performance of the applications that I haven't ever seen. There are cases when locks make the code [even faster](http://stackoverflow.com/questions/28708058/why-does-this-code-run-faster-with-a-lock)! But all these examples and benchmark numbers don't mean anything practical without proper analysis. — apangin, Nov 11 '16 at 19:19

Java: counting/profiling synchronization calls

The setting:

The Task

The Question

4 Answers4