Java scheduler executor timing issues on virtual windows server

Question

We have a Java application that needs to run, among other environments, on a Virtual (Hyper-V) Windows 2012 R2 Server. When executed on this virtual Windows server, it seems to experience weird timing issues. We've traced the issue to erratic scheduling in a Java scheduled executor:

public static class TimeRunnable implements Runnable {

    private long lastRunAt;

    @Override
    public void run() {
        long now = System.nanoTime();
        System.out.println(TimeUnit.NANOSECONDS.toMillis(now - lastRunAt));
        lastRunAt = now;
    }

}

public static void main(String[] args) {
    ScheduledExecutorService exec = Executors.newScheduledThreadPool(1);
    exec.scheduleAtFixedRate(new TimeRunnable(), 0, 10, TimeUnit.MILLISECONDS);
}

This code, which should run the TimeRunnable every 10ms, produces results such as these on the server:

While on other machines, including heavily loaded virtual Linux boxes, as well as some windows desktops, a typical run looks like this:

We don't have a lot of experience with Windows Server and Hyper-V, so can anyone offer an explanation for this phenomena? It it a Windows Server issue? Hyper-V? A Java bug on these platforms? Is there a solution?

EDIT: A colleague has written a C# version of the same program:

private static Stopwatch stopwatch = new Stopwatch();

public static void Main()
{
    stopwatch.Start();
    Timer timer = new Timer(callback, null, TimeSpan.FromMilliseconds(10), TimeSpan.FromMilliseconds(10));
}

private static void callback(object state)
{
    stopwatch.Stop();
    TimeSpan span = stopwatch.Elapsed;
    Console.WriteLine((int)span.TotalMilliseconds);
    stopwatch.Restart();
}

Here's an updated (partial) screenshot of both applications working side by side on the virtual windows server:

EDIT: A few other variants of the Java program all produce (pretty much) the same output:

A variant in which System.nanoTime() was replaced with System.currentTimeMillis()
A variant in which System.out.println() was replaced with a periodically printed StringBuilder
A variant in which the scheduling mechanism was replaced with a single thread that times itself through Thread.sleep()
A variant in which the lastRunAt is volatile

I have no clue why this might happen, but I do have one suggestion, try to have a small c# app that does the same and see if the result is similar. Run this app on the windows server host (I hope it's also a windows 2012 R2) and see if the timing issue is there too, this is to figure out if the issue is with windows server or hyper-v. — Augusto, Feb 16 '16 at 11:36
@Augusto I've updated the question with a C# version of the program — Malt, Feb 16 '16 at 14:36
@Malt really very strange!!! Can you able to generate thread dump using `virtual vm`? May be analyzing the dump will lead in help.. — Amogh, Feb 16 '16 at 16:58
by any chance the culprit is `sysout`? can you try replacing the print statement with `appending to a StringBuilder` and after a while try printing the same, just to minimize effect from external factors.. — Jos, Feb 24 '16 at 17:50
those to programs are not equal, `nanoTime()` is not the same as the .net call, so it probably gets info from a different source. did you try `System.currentTime()`? — , Feb 25 '16 at 06:29
you might be knowing this already but, if you want to achieve more accurate scheduling, you can schedule threads manually.. schedule a thread once, once that is run, in the end compute the time till next expected running time, schedule another thread with the newly computed value and so on.. — Jos, Feb 25 '16 at 06:35
Have you tried making `lastRunAt` volatile? Technically, there is no guarantee that each run execute in the same thread so there could be a visibility issue. It may not make a difference. — assylias, Feb 25 '16 at 09:36
Try running these [two snippets](https://gist.github.com/apangin/749b5d206a7d92e5ad58). What will they print? I'm pretty sure the problem comes from inconsistent values of [QueryPerformanceCounter](https://msdn.microsoft.com/library/windows/desktop/ms644904%28v=vs.85%29.aspx) across different (virtual) CPUs, but some more tests would be needed to trace down the issue. Try also pinning a process to a single CPU, will it make difference? — apangin, Feb 25 '16 at 23:31

OldCurmudgeon · Accepted Answer · 2016-02-25T11:30:39.237

This is caused by the System.currentTimeMillis() granularity. Note the comment there:

Note that while the unit of time of the return value is a millisecond, the granularity of the value depends on the underlying operating system and may be larger.

I recorded a granularity of about 15ms on one machine a while back. This would explain all of the 0 values you are seeing but not the big values.

Running an enhanced version of your test:

static final TreeMap<Long, AtomicInteger> counts = new TreeMap<>();

public static final AtomicInteger inc(AtomicInteger i) {
    i.incrementAndGet();
    return i;
}

public static class TimeRunnable implements Runnable {

    private long lastRunAt;

    @Override
    public void run() {
        long now = System.nanoTime();
        long took = TimeUnit.NANOSECONDS.toMillis(now - lastRunAt);
        counts.compute(took, (k, v) -> (v == null) ? new AtomicInteger(1) : inc(v));
        //System.out.println(TimeUnit.NANOSECONDS.toMillis(now - lastRunAt));
        lastRunAt = now;
    }

}

public void test() throws InterruptedException {
    System.out.println("Hello");
    ScheduledExecutorService exec = Executors.newScheduledThreadPool(1);
    exec.scheduleAtFixedRate(new TimeRunnable(), 0, 10, TimeUnit.MILLISECONDS);
    // Wait a bit.
    Thread.sleep(10000);
    // Shut down.
    exec.shutdown();
    while (!exec.awaitTermination(60, TimeUnit.SECONDS)) {
        System.out.println("Waiting");
    }
    System.out.println("counts - " + counts);
}

I get the output:

counts - {0=361, 2=1, 8=2, 13=2, 14=18, 15=585, 16=25, 17=1, 18=1, 22=1, 27=1, 62=1, 9295535=1}

The huge outlier is the first hit - when lastRunAt is zero. The 0=361 was when you were called 10ms later but System.currentTimeMillis() had not kicked over one of it's ticks. Note the peak at 15=585 showing a clear peak at 15ms as I suggested.

I have no explanation for the 62=1.

According to this comment: http://stackoverflow.com/questions/351565/system-currenttimemillis-vs-system-nanotime#comment9205852_351565 15ms is Windows **feature**. — zloster, Feb 29 '16 at 19:48

Stephen C · Answer 2 · 2016-02-16T14:43:59.850

3

I also don't know why it is happening. However, it is unlikely to be Java's fault. Java uses native threads, which means that thread scheduling is handled by "the operating system".

I think the real problem here is that you have built an application based on a false premise. If you read the Java documentation (for a normal / non-real-time JVM) you won't find anything that says that Java thread scheduling is accurate. Even scheduling priorities are "best effort".

The fact that you have observed scheduling to be reasonably accurate on a heavily loaded Linux VM is interesting ... but not necessarily instructive. Scheduling accuracy will depend on the nature of the load on the system. And probably one whether there is significant "overcommit" of memory, VCPUs, and I/O bandwidth in the platform.

Is there a solution?

Maybe you could figure a way to make scheduling more "accurate" on your platform (on a good day with a following wind). However, you won't get any guarantees of accuracy unless you switch to real-time OS and a real-time Java release. You won't find any real-time Java implementations for a virtualized platform. So the real solution is to avoid relying on accurate scheduling.

edited Feb 16 '16 at 14:43

answered Feb 16 '16 at 11:58

Stephen C

698,415
94
811
1,216

We never assumed that we have any hard accuracy guarantees, so this behavior doesn't break any of our logic. However in all other environments, if we schedule a task to run with a period of 10ms, we normally see that task executed every 8-12ms with the occasional rare jump to maybe 100ms. On this server however, we're nowhere near that; something seems to be actually broken, which makes us curious. – Malt Feb 16 '16 at 14:56
1

Also, the .NET version of the test program (see the updated question) seems to be very consistent in scheduling its tasks. This indicates that something might be broken in Java on this platform. – Malt Feb 16 '16 at 15:00
IMO this is within the range of expected behavior. Therefore, I think you should deem your "curiosity" to be "satisfied", and move on. There is possibly a system administration issue; e.g. tuning that particular VM or hypervisor to improve the scheduling "accuracy", but this isn't a programming issue (IMO). – Stephen C Feb 16 '16 at 22:32

score 1 · Answer 3 · edited May 23 '17 at 10:32

I think you need to increase priority of the java application process and worker thread inside java application both. Its easy to increase priority of worker thread inside java application. But its tricky to set java application to get higher cpu than what you are getting. Probably this might help in getting higher cpu for your program

How to change the priority of a running java process?

https://blogs.msdn.microsoft.com/oldnewthing/20100610-00/?p=13753

You can also give a look to get real time cpu but please note that it can delay your other kernel activities including mouse and keyboard events

The delay is definitely due to task couldn't start at the specified time and hence the next task got triggered before the period time to adjust the fixed rate as mentioned here: Java Timer

Robert Navado · Answer 4 · 2016-06-22T09:29:26.657

Most of modern HW provides multiple timer sources. Additionally, most of operation systems provide several APIs to access these timer counters with different precision (For example System timer and RTC). Knowing Microsoft, .NET platform (as well as most of MS products) utilizes intimate knowledge of Win32 API and kernel APIs. My intuition says that Timer class in C# uses different API than Java (Hotspot VM implementation described here, though it's correct for Java 5).
There is generic problem with timer precision in virtual environments. I found very interesting test results http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4503740/ describing similar issues with different hypervisors. The funny thing that Hyper-V not mentioned there, but the problem looks like not unique for the particular setup. Microsoft has an issue regarding timers correctness provided by Hyper-V running on Windows 2008 R2. god knows what's running in cloud for different cloud providers. I personally was able to reproduce the issue on AWS cloud.
So the answer to "What's this effect is" - It's hypervisors bug combined with Java implementation "features". To be sure, you can try running this test with OpenJDK where you can see the code and play with different timer sources.
But for practical reasons I suggest to avoid running timer-sensitive Java code on Windows VM. In case this is very hard requirement, I would try use Win32 timer and call JVM code from there (using JNI) or implement any other timer source (using named pipe or any other platform-specific patch). You can try using Quartz as timer and scheduler, but it's probably suffering from the same problem as well.

Java scheduler executor timing issues on virtual windows server

4 Answers4