7

I am trying to demonstrate an "anytime algorithm" - an algorithm that can be stopped at any time and returns its current result. The demo algorithm just returns some mathematical function of i, where i is increasing. It chcecks whether it is interrupted, and if so, returns the current value:

    static int algorithm(int n) {
        int bestSoFar = 0;
        for (int i=0; i<n; ++i) {
            if (Thread.interrupted())
                break;
            bestSoFar = (int)Math.pow(i, 0.3);
        }
        return bestSoFar;
    }

In the main program, I use it like this:

        Runnable task = () -> {
            Instant start = Instant.now();
            int bestSoFar = algorithm(1000000000);
            double durationInMillis = Duration.between(start, Instant.now()).toMillis();
            System.out.println("after "+durationInMillis+" ms, the result is "+bestSoFar);
        };

        Thread t = new Thread(task);
        t.start();
        Thread.sleep(1);
        t.interrupt();

        t = new Thread(task);
        t.start();
        Thread.sleep(10);
        t.interrupt();

        t = new Thread(task);
        t.start();
        Thread.sleep(100);
        t.interrupt();

        t = new Thread(task);
        t.start();
        Thread.sleep(1000);
        t.interrupt();
    }
}

When I run this program, I get the following input:

after 0.0 ms, the result is 7
after 10.0 ms, the result is 36
after 100.0 ms, the result is 85
after 21952.0 ms, the result is 501

I.e, the first three threads are really interrupted when I tell them to, but the last thread is not interrupted after 1 second - it continues working for almost 22 seconds. Why does this happen?

EDIT: I get similar results using Future.get with a timeout. In this code:

    Instant start = Instant.now();
    ExecutorService executor = Executors.newCachedThreadPool();
    Future<?> future = executor.submit(task);
    try {
        future.get(800, TimeUnit.MILLISECONDS);
    } catch (TimeoutException e) {
        future.cancel(true);
        double durationInMillis = Duration.between(start, Instant.now()).toMillis();
        System.out.println("Timeout after "+durationInMillis+" [ms]");
    }

if the timeout is at most 800 then everything works find and it prints something like "Timeout after 806.0 [ms]". But if the timeout is 900 it prints "Timeout after 5084.0 [ms]".

EDIT 2: my computer has 4 cores. The program rusn on Open JDK 8.

Erel Segal-Halevi
  • 33,955
  • 36
  • 114
  • 183
  • Is the loop actually looping, or is it getting stuck for some reason? – Carcigenicate Nov 26 '17 at 20:03
  • @Carcigenicate how can I tell? – Erel Segal-Halevi Nov 26 '17 at 20:06
  • You could just stick a `print` in the `algorithm` loop. Something like `System.out.println(bestSoFar + " " + Thread.interrupted())`. – Carcigenicate Nov 26 '17 at 20:08
  • If that doesn't print at all, it's stuck. If it does print, then it'll at least tell you a bit about the state of the function. – Carcigenicate Nov 26 '17 at 20:09
  • Works for me (the last thread terminates with `after 1002.0 ms, the result is 142`) – teppic Nov 26 '17 at 20:14
  • 1
    Works for me as well. Maybe the program has only one CPU core to use and the main thread that does the interrupting doesn't get a chance to run for a long time? Printing a timestamp before sleep and before interrupt may give some insight. – binoternary Nov 26 '17 at 20:24
  • I _can_ reproduce this by starting tasks in a loop and sleeping for 100ms between `start` and `interrupt`. I'm calling `t.join()` after the interrupt, so threads aren't stacking. After 10 loops it hangs for a minute in the `sleep` (i.e. the main thread sleeps for waaaay longer than the requested 100ms before waking up and performing the interrupt). OSX, jdk 1.8.0_131 fwiw. – teppic Nov 26 '17 at 21:10
  • 1
    interesting, looks like it's related to safepoints and JIT compiler. I can't reproduce it with your example, but if I replace {code}bestSoFar = (int)Math.pow(i, 0.3);{code} to {code}bestSoFar = i {code} I often see that last thread is not interrupted. Maybe @apangin can help – AdamSkywalker Nov 26 '17 at 21:20
  • While I'm not sure what is causing the main thread to be left unscheduled, the problem can be fixed by adding a `Thread.yeild()` call in the task loop. – teppic Nov 26 '17 at 21:52
  • 1
    How many cores does your system have? Is your program competing for CPU resources with any other process? – Gray Nov 26 '17 at 23:38
  • It looks very much like [this bug](http://bugs.java.com/bugdatabase/view_bug.do?bug_id=8154302). – teppic Nov 27 '17 at 02:32
  • Agree with @teppic, issue looks similar JDK-8154302, investigating more with compiler team to answer AdamSkywalker for JIT related issue – Fairoz Nov 27 '17 at 04:34
  • @teppic Thread.yield solved the problem for me too. Thanks! – Erel Segal-Halevi Nov 27 '17 at 05:29

1 Answers1

4

I can confirm this is a HotSpot JVM bug. Here is my initial analysis of the problem.

@AdamSkywalker was absolutely right in the assumption that the issue is related to safepoint elimination optimization in HotSpot HIT compiler. Though the bug JDK-8154302 looks similar, in fact it is a different problem.

What is the safepoint problem

Safepoint is the JVM mechanism to stop application threads in order to perform an operation that requires stop-the-world pause. Safepoints in HotSpot are collaborative, i.e. application threads periodically check if they need to stop. This check typically happens at method exits and inside loops.

Of course, this check is not free. So, for performance reasons JVM tries to eliminate redundant safepoint polls. One of such optimizations is to remove safepoint polls from the counted loops - the loops of the form

    for (int i = 0; i < N; i++)

or equivalent. Here N is a loop invariant of int type.

Usually these loops are short-running, but in some cases they can take really long, e.g. when N = 2_000_000_000. Safepoint operation requires that all Java threads are stopped (excluding threads runnings native methods). That is, a single long-running counted loop may delay the whole safepoint operation, and all other threads will wait for this one to stop.

That's exactly what happens in JDK-8154302. Note that

    int l = 0;
    while (true) {
        if (++l == 0) ...
    }

is just another way to express a counted loop of 232 iterations. When Thread.sleep returns from a native function and finds that safepoint operation is requested, it stops and waits until the long-running counted loop also completes. That's where the weird delays come from.

There is a task to fix this problem - JDK-8186027. The idea is to split one long loop into two parts:

    for (int i = 0; i < N; i += step) {
        for (int j = 0; j < step; j++) {
            // loop body
        }
        safepoint_poll();
    }

It is not yet implemented, but the fix is targeted to JDK 10. Meanwhile there is a workaround: the JVM flag -XX:+UseCountedLoopSafepoints will force safepoint checks inside counted loops, too.

What is wrong with Thread.interrupted()

I'm pretty sure Thread.sleep bug will be closed as a duplicate of Loop strip mining issue. You may verify that this bug goes away with -XX:+UseCountedLoopSafepoints option.

Unfortunately, this option does not help with the original problem. I caught the moment when algorithm in the original question hangs and looked at the code being executed under gdb:

loop_begin:
  0x00002aaaabe903d0:  mov    %ecx,%r11d
  0x00002aaaabe903d3:  inc    %r11d             ; i++
  0x00002aaaabe903d6:  cmp    %ebp,%r11d        ; if (i >= n)
  0x00002aaaabe903d9:  jge    0x2aaaabe90413    ;     break;
  0x00002aaaabe903db:  mov    %ecx,%r8d
  0x00002aaaabe903de:  mov    %r11d,%ecx
  0x00002aaaabe903e1:  mov    0x1d0(%r15),%rsi  ; rsi = Thread.current();
  0x00002aaaabe903e8:  mov    0x1d0(%r15),%r10  ; r10 = Thread.current();
  0x00002aaaabe903ef:  cmp    %rsi,%r10         ; if (rsi != r10)
  0x00002aaaabe903f2:  jne    0x2aaaabe903b9    ;     goto slow_path;
  0x00002aaaabe903f4:  mov    0x128(%r15),%r10  ; r10 = current_os_thread();
  0x00002aaaabe903fb:  mov    0x14(%r10),%r11d  ; isInterrupted = r10.interrupt_flag;
  0x00002aaaabe903ff:  test   %r11d,%r11d       ; if (!isInterrupted)
  0x00002aaaabe90402:  je     0x2aaaabe903d0    ;     goto loop_begin

This is how the loop in algorithm method is compiled. There is no safepoint poll here, even when -XX:+UseCountedLoopSafepoints is set.

Looks like safepoint checks were incorrectly eliminated because of Thread.isInterrupted call which was supposed to check the safepoint itself. However, Thread.isInterrupted is HotSpot intrinsic method. It means there is no real native method call, but JIT replaces a call to Thread.isInterrupted with a sequence of machine instructions with no safepoint checks inside.

I'll report the bug to Oracle shortly. Meanwhile, the workaround is to change the type of loop counter from int to long. If you rewrite the loop as

    for (long i=0; i<n; ++i) { ...

there will be no strange delays anymore.

apangin
  • 92,924
  • 10
  • 193
  • 247
  • Great answer! Can you elaborate on why it is needed for the thread to reach a safepoint in order for interruption flag to propagate? It does not seem to be a stop-the-world operation on a first glance. – Oliver Gondža Mar 24 '19 at 13:47
  • No, safepoint is not needed for interruption flag propagation. I don't see how this conclusion could be made from my answer. – apangin Apr 02 '19 at 23:25
  • Well, the question is why the interrupt is not acted upon as expected and your answer suggests the cause is a safepoint elimination. It is the connection between the interrupts and safepoints I am missing. – Oliver Gondža Apr 04 '19 at 06:23
  • @OliverGondža The key sentence is: "When `Thread.sleep` returns from a native function and finds that safepoint operation is requested, it stops and waits until the long-running counted loop also completes." Safepoint check is basically a part of every [native method invocation](https://stackoverflow.com/a/24747484/3448419) in HotSpot. It does not matter what is the reason of a safepoint - it could be [any random reason](https://stackoverflow.com/a/29673564/3448419) not related to thread interruption, e.g. a [periodic cleanup safepoint](https://stackoverflow.com/a/55405726/3448419). – apangin Apr 04 '19 at 23:41