13

I was debugging high System CPU usage (Not user CPU usage) on of our storm supervisors (Wheezy machine). Here are the observations

Output of perf for the relevant process:

Events: 10K cpu-clock
16.40%  java  [kernel.kallsyms]   [k] system_call_after_swapgs
13.95%  java  [kernel.kallsyms]   [k] pvclock_clocksource_read
12.76%  java  [kernel.kallsyms]   [k] do_gettimeofday
12.61%  java  [vdso]              [.] 0x7ffe0fea898f
 9.02%  java  perf-17609.map      [.] 0x7fcabb8b85dc
 7.16%  java  [kernel.kallsyms]   [k] copy_user_enhanced_fast_string
 4.97%  java  [kernel.kallsyms]   [k] native_read_tsc
 2.88%  java  [kernel.kallsyms]   [k] sys_gettimeofday
 2.82%  java  libjvm.so           [.] os::javaTimeMillis()
 2.39%  java  [kernel.kallsyms]   [k] arch_local_irq_restore

Caught this in strace of a thread of the relevant process

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00    0.000247           0     64038           gettimeofday
  0.00    0.000000           0         1           rt_sigreturn
  0.00    0.000000           0         1           futex
------ ----------- ----------- --------- --------- ----------------
100.00    0.000247                 64040           total

Finally figured out that the thread was running in while(true) and one of the calls inside was System.currentTimeMillis(). I disabled the same and the system CPU % went down from 50% to 3%. So clearly that was the issue. What I fail to understand is, in the presence of vDSO these kernel calls should only happen in the user's address space. But as is clear from perf report, kernel calls are indeed taking place in the kernel space. Any pointers on this? Kernel version: 3.2.0-4-amd64 Debian 3.2.86-1 x86_64 GNU/Linux
clock type: kvm

Adding code of the problematic thread.

@RequiredArgsConstructor
public class TestThread implements Runnable {
    private final Queue<String> queue;
    private final Publisher publisher;
    private final int maxBatchSize;

    private long lastPushTime;
    @Override
    public void run() {
        lastPushTime = System.currentTimeMillis();
        List<String> events = new ArrayList<>();
        while (true) {
            try {
                String message = queue.poll();
                long lastPollTime = System.currentTimeMillis();
                if (message != null) {
                    events.add(message);
                    pushEvents(events, false);
                }

                // if event threshold hasn't reached the size, but it's been there for over 10seconds, push it.
                if ((lastPollTime - lastPushTime > 10000) && (events.size() > 0)) {
                    pushEvents(events, true);
                }
            } catch (Exception e) {
                // Log and do something
            }
        }
    }

    private void pushEvents(List<String> events, boolean forcePush) {
        if (events.size() >= maxBatchSize || forcePush) {
            pushToHTTPEndPoint(events);
            events.clear();
            lastPushTime = System.currentTimeMillis();
        }
    }

    private void pushToHTTPEndPoint(List<String> events) {
        publisher.publish(events);
    }
}
Rahul Jha
  • 2,773
  • 2
  • 13
  • 14
  • 1
    What else was in this `while(true)` loop? – MrSmith42 Dec 29 '17 at 18:51
  • Where do you think the Java process gets the time from? – Thorbjørn Ravn Andersen Dec 29 '17 at 18:52
  • @MrSmith42 An HTTP End point was being called. Not every time though. There was internal batching happening in the thread with linger and longer times and currentimemillis was being used precisely for that. – Rahul Jha Dec 29 '17 at 18:55
  • @ThorbjørnRavnAndersen AFAIK, System.currentTimeMillis doesn't require switching to kernel mode. They are routed through vDSO which executes those calls in user's address space. https://stackoverflow.com/a/34597360/2377581 – Rahul Jha Dec 29 '17 at 18:57
  • 3
    I think we need to see the source of that while-loop to get an idea of what happened. – MrSmith42 Dec 29 '17 at 19:02
  • 1
    “…one of the calls inside was `System.currentTimeMillis()`. I disabled the same and the system CPU % went down from 50% to 3%. So clearly that was the issue.” That is not clear to me at all. I currently have no idea what else was in that loop, and I don’t know whether all of it executed in identical conditions. – VGR Dec 29 '17 at 19:10
  • Added the sample code. What i meant by disabling was to remove the system.currentimemillis call altogether and just doing count ++. System CPU went down but user CPU increased which was expected. Since this is on apache storm, imo the best fix would be to use the tick tuples (given by storm) to do the batching. – Rahul Jha Dec 29 '17 at 19:21
  • [MCVE](https://stackoverflow.com/help/mcve). Recreate the problem in as little code as possible. – Basil Bourque Dec 29 '17 at 20:27
  • 2
    1. Your loop is a spin-loop with no sleeping or waiting. 2. Your pushToHTTPEndPoint method’s execution speed probably depends on the level of network congestion. 3. Exceptions impair performance, but you don’t even know if an exception is being thrown. 4. Your List of events grows indefinitely. There are many factors that contribute to speed besides calls to System.currentTimeMillis. – VGR Dec 29 '17 at 21:29
  • 1
    @VGR, while I agree with your statement, I’m pretty darn sure that system.currentimemillis is the culprit. PushToHTTPEndpoint will not make a network call until the batch is closed (which could be due to size or every 10 seconds). Also there are no exceptions. Also it’s pretty clear in the perf report and strace output that most of the sys CPU is being spent on the clock. I was more curious about why weren’t the vDSO optimisations kicking in. Thanks! – Rahul Jha Dec 30 '17 at 05:46
  • 3
    @RahulJha - You are correct in your understanding/ interpretation of strace and perf outputs. gettimeofday is indeed making a trip to kernel land in your case. vDSO is not being used and your system cpu usage is mostly due to that (context switch to kernel) Are you running this on some kind of virtualized setup? vDSO can fall back to usual system call too depending upon how timekeeping was done in the kernel. You should also check clock source on your system as that has a direct impact on vDSO. – Sumit Nigam Jan 02 '18 at 07:27
  • @SumitNigam Thanks sumit. Yes these are cloudified VMs. And it’s kvmclock. Xen clock had a problem with vDSO but kvmclock never did. – Rahul Jha Jan 02 '18 at 07:36
  • @RahulJha - Can't be sure but [link](https://github.com/torvalds/linux/blob/v3.13/arch/x86/vdso/vclock_gettime.c#L277) indicates that even KVM can have system level calls. – Sumit Nigam Jan 02 '18 at 09:02
  • i agree with @Rich you should rewrite the code , to avoid while(true) . simple Fix would be as Rich mentioned Blocking Queue, would block until the queue receive any message - off course you should handle timeout – Mani Jan 02 '18 at 22:19

4 Answers4

4

There is nothing else of note inside the loop, so you are spinning on System.currentTimeMillis()

vDSO will help improve the performance of System.currentTimeMillis(), but does it really change the classification of the CPU from "System" to "User"? I don't know, sorry.

This thread is going to be consuming 100% CPU, does it make a lot of difference whether it is classified as "System" or "User"?

You should rewrite this code to use a non-spin wait, for example BlockingQueue.poll(timeout)

What is your actual question here?

What I fail to understand is, in the presence of vDSO these kernel calls should only happen in the user's address space. But as is clear from perf report, kernel calls are indeed taking place in the kernel space. Any pointers on this?

Why does it matter how the CPU time spent inside this spin-lock is classified?

According to User CPU time vs System CPU time? the "System CPU Time" is:

System CPU Time: Amount of time the processor worked on operating system's functions connected to that specific program.

By that definition, time spent spinning on the System.currentTimeMillis() would count as System time, even if it did not require a user-to-kernel mode switch due to vDSO.

Rich
  • 15,048
  • 2
  • 66
  • 119
  • Hey Rich! Appreciate your response here. My only problem was if vDSO is not doing its job and the kernel calls are executed in kernel space, CPU will have to switch its context to kernel mode which is expensive. I was curious if vDSO is kicking in at all. – Rahul Jha Jan 05 '18 at 16:11
  • Thanks, that makes sense. You could have asked about that more directly, for example "how do I tell if vDSO is being used for System.currentTimeMillis()". Good luck! – Rich Jan 09 '18 at 11:07
  • I think it is also worth noting that vDSO won't actually do you any good here. If you are spinning on `currentTimeMillis`, then it doesn't make any difference whether `currentTimeMillis` is performant or not. A spin lock which spins 10,000 times while waiting or which spins 10 times while waiting will still wait the same time and still take 100% CPU. – Rich Jan 09 '18 at 11:08
  • Agreed and apologies if the initial post was misleading. – Rahul Jha Jan 09 '18 at 11:42
4

What I fail to understand is, in the presence of vDSO these kernel calls should only happen in the user's address space. But as is clear from perf report, kernel calls are indeed taking place in the kernel space. Any pointers on this?

vDSO could be disabled on a virtual system. KVM uses PVClock (you could read more about in this nice article) and it depends on kernel version. For example, we could see here that VCLOCK_MODE is never overridden. On the other hand, here it is changed vclock_mode - and vclock_mode indicator for vDSO too.

This support was introduced in this commit and released in 3.8 version of Linux kernel.

Generally, in my practice, if you call something inside "while(true)" for a long time, you will always see a big CPU consumption.

Of course, Blocking Queue is enough in most cases, but if you need good latency and performance, you could use spinning too, without thread blocking, but you should limit spin cycles and make benchmarks to measure the impact of this optimization. The meta code could be something like:

int spin = 100;
while(spin-- > 0) {
    // try to get result
}
// still no result -> execute blocking code
egorlitvinenko
  • 2,736
  • 2
  • 16
  • 36
  • Thanks for the reply. However we do see the vDSO mapped in perf output which is attached above. – Rahul Jha Jan 08 '18 at 04:24
  • It's a good point. I have a theory to explain, but I think we need a test, then discuss. Could you update Linux kernel? – egorlitvinenko Jan 08 '18 at 07:06
  • Also, if you'd like and could use, there is util in linux distribution - https://github.com/torvalds/linux/blob/master/tools/perf/perf-read-vdso.c to collect perf map for vDSO to see what exactly called instead of hex symbols. – egorlitvinenko Jan 08 '18 at 07:08
  • Egor, I can't update the kernel. Could you please share your hypothesis? – Rahul Jha Jan 09 '18 at 08:32
  • The condition of VCLOCK_MODE is a fallback inside the vDSO library, so it is not surprising that there is vDSO in perf. – egorlitvinenko Jan 09 '18 at 09:40
  • Hmmm. Interesting thing is if i run this command strace -e trace=clock_gettime date, I get clock_gettime(CLOCK_REALTIME, {1515491572, 892371703}) = 0 Tue Jan 9 15:22:52 IST 2018. No system calls are being made indicating vDSO is doing its job. – Rahul Jha Jan 09 '18 at 09:55
  • Egor, problem was with kvmclock on my kernel version as you suggested. Much appreciate your help on this. – Rahul Jha Jan 09 '18 at 12:07
3

So I figured out the issue here. To give more context, the question was more about the fact that vDSO making system calls(Apologies if the original post was misleading!). The clock source for this kernel version (kvmclock) didn't have support for virtual system calls and hence real system calls were happening. It was introduced in this commit https://github.com/torvalds/linux/commit/3dc4f7cfb7441e5e0fed3a02fc81cdaabd28300a#diff-5a34e1e52f50e00cef4b0d0ff3fef8f7 (Thanks to egorlitvinenko for pointing this out.

Also, I do understand that anything in while(true) will consume CPU. Since this was in apache storm context where the call was to essentially batch events before making HTTP call, this could've been done in a better way by using tick tuples support of apache storm.

Rahul Jha
  • 2,773
  • 2
  • 13
  • 14
2

by reading your code, there is no control code to block the while loop, except publisher.publish(events)and queue.poll(),that means this thread is busy in while loop,never take a break.

in my opinion,you need to limit the calls on System.currentTimeMillis().a good choice is make queue.poll() blocking,some pseudocode :

while (!stopWork) {
    try {
        // wait for messages with 10 seconds timeout,if no message or timeout return empty list
        // this is easy to impl with BlockingQueue
        List<String> events = queue.poll(10,TimeUnit.SECOND);
        if (events.isEmpty()) {
            continue;
        }
        new java.util.Timer().schedule( 
            new java.util.TimerTask() {
                @Override
                public void run() {
                    pushEvents(events, true);
                }
            }, 1000*10 );
    } catch (Exception e) {
        // Log and do something
    }
}
chenjun
  • 301
  • 3
  • 10