10

I have a small program running on Linux (on an embedded PC, dual-core Intel Atom 1.6GHz with Debian 6 running Linux 2.6.32-5) which communicates with external hardware via an FTDI USB-to-serial converter (using the ftdi_sio kernel module and a /dev/ttyUSB* device). Essentially, in my main loop I run

  • clock_gettime() using CLOCK_MONOTONIC
  • select() with a timeout of 8 ms
  • clock_gettime() as before
  • Output the time difference of the two clock_gettime() calls

To have some level of "soft" real-time guarantees, this thread runs as SCHED_FIFO with maximum priority (showing up as "RT" in top). It is the only thread in the system running at this priority, no other process has such priorities. My process has one other SCHED_FIFO thread with a lower priority, while everything else is at SCHED_OTHER. The two "real-time" threads are not CPU bound and do very little apart from waiting for I/O and passing on data.

The kernel I am using has no RT_PREEMPT patches (I might switch to that patch in the future). I know that if I want "proper" realtime, I need to switch to RT_PREEMPT or, better, Xenomai or the like. But nevertheless I would like to know what is behind the following timing anomalies on a "vanilla" kernel:

  • Roughly 0.03% of all select() calls are timed at over 10 ms (remember, the timeout was 8 ms).
  • The three worst cases (out of over 12 million calls) were 31.7 ms, 46.8 ms and 64.4 ms.
  • All of the above happened within 20 seconds of each other, and I think some cron job may have been interfering (although the system logs are low on information apart from the fact that cron.daily was being executed at the time).

So, my question is: What factors can be involved in such extreme cases? Is this just something that can happen inside the Linux kernel itself, i.e. would I have to switch to RT_PREEMPT, or even a non-USB interface and Xenomai, to get more reliable guarantees? Could /proc/sys/kernel/sched_rt_runtime_us be biting me? Are there any other factors I may have missed?

Another way to put this question is, what else can I do to reduce these latency anomalies without switching to a "harder" realtime environment?

Update: I have observed a new, "worse worst case" of about 118.4 ms (once over a total of around 25 million select() calls). Even when I am not using a kernel with any sort of realtime extension, I am somewhat worried by the fact that a deadline can apparently be missed by over a tenth of a second.

mindriot
  • 5,413
  • 1
  • 25
  • 34
  • 2
    I'm not a linux expert, but maybe a non-preemptible driver interrupt have a corner case taking too much time ? That would be a driver bug. Also, your delay values are strangely close to multiples of 16ms. I bet you have many calls that last 16ms. Maybe it's a symptom of chain retries of a failed communication, or at least, something fails and is retried. – ElderBug May 20 '15 at 08:42
  • The "strange closeness" to 16ms multiples is just a random effect of this example, I have observed other times as well. – mindriot May 20 '15 at 08:52
  • 1
    Is this a tickless kernel, or what is your HZ? I remember a while ago `select()` granularity was only 10ms with `HZ=100`. Try `epoll` instead. – Maxim Egorushkin May 22 '15 at 13:02
  • On the system in question, `CONFIG_HZ=250`. But even if granularity was 10 ms, I would expect at most 20 ms instead of 8 - not 118. – mindriot May 22 '15 at 16:59
  • From what I've read about `epoll`, it only provides an advantage if you're polling a large number of file descriptors (e.g. http://stackoverflow.com/questions/8251717/epoll-vs-select-for-very-small-number-of-connections). I have only one. Could you point me to other reasons that speak for `epoll`, or ideally a reason why `epoll` might do better with regards to my observed worst-case behavior? That would be highly appreciated. – mindriot May 22 '15 at 17:03

1 Answers1

2

Without more information it is difficult to point to something specific, so I am just guessing here:

  1. Interrupts and code that is triggered by interrupts take so much time in the kernel that your real time thread is significantly delayed. This depends on the frequency of interrupts, which interrupt handlers are involved, etc.
  2. A thread with lower priority will not be interrupted inside the kernel until it yields the cpu or leaves the kernel.
  3. As pointed out in this SO answer, CPU System Management Interrupts and Thermal Management can also cause significant time delays (up to 300ms were observed by the poster).

118ms seems quite a lot for a 1.6GHz CPU. But one driver that accidently locks the cpu for some time would be enough. If you can, try to disable some drivers or use different driver/hardware combinations.

sched_rt_period_us and sched_rt_period_us should not be a problem if they are set to reasonable values and your code behaves as you expect. Still, I would remove the limit for RT threads and see what happens.

What else can you do? Write a device driver! It's not that difficult and interrupt handlers get a higher priority than realtime threads. It may be easier to switch to a real time kernel but YMMV.

Community
  • 1
  • 1
Mackie Messer
  • 7,118
  • 3
  • 35
  • 40
  • These are definitely some good recommendations, I will see what I can do about disabling some drivers. Removing the RT thread limit _may_ have helped (I don't have enough data yet to be sure though). The device driver option hadn't occurred to me. I'll check if that may be an option in my case. – mindriot May 26 '15 at 06:37
  • I know I haven't provided a lot of specific information, but I am mostly interested in general guesses so I know where to dig and actually gather more information, or when to stop digging before it starts taking too much of my time. Disabling some drivers might be worth it, but considering I have to wait around 24 hours to see if something improved, it's complicated. Could kernel profiling help? – mindriot May 26 '15 at 06:42
  • @mindriot I would try to reproduce the worst case, e.g. generate cpu load, disk and/or network traffic etc. Not sure if kernel profiling would help, because you are looking for a patological case not the average case. – Mackie Messer May 26 '15 at 07:23
  • Thanks very much for your input. I'll leave this open for a few more days to see if any other ideas of what I might have overlooked come up. If not I'll accept your answer as a "nope, there's nothing else you might have missed, hunting down buggy drivers or going to a more realtime kernel is all you can do". :) – mindriot May 28 '15 at 06:39
  • @mindriot Yes, I am curious about other peoples theories too. If you are able to resolve this, please come back and answer your own question. Thanks! – Mackie Messer May 28 '15 at 09:42
  • After running my software and observing it for several more days, it seems that more sensible `sched_rt_period_us`/`sched_rt_runtime_us` indeed made a difference. At least I haven't seen such large latency outliers as before, over a longer observation period than before. I am now running with 1000/995 rather than the default 1000/950. The worst-case `select` latency was 34.5ms, second-worst was 19.4ms. There may be other potential latency killers, but it _seems_ like I will be able to live with them (and I don't have the time to dig any deeper unfortunately). – mindriot Jun 01 '15 at 08:39
  • Since you pointed out everything that might have an influence on my latency problem, your post deserves to be accepted as the answer. – mindriot Jun 01 '15 at 08:40
  • @mindriot I saw your edits were rejected, so I turned the answer into a wiki. – Mackie Messer Jul 03 '15 at 15:12
  • Oh, thanks. I was just going to add it as a separate answer. I thought it would be easier to find for others if it appeared as part of the same answer, so your modification is appreciated. I'll put in my addition right away. – mindriot Jul 13 '15 at 07:26
  • Ah, of course my SO account is still too young :) I don't have 100 rep yet. We'll see if it gets accepted this time. – mindriot Jul 13 '15 at 07:30
  • @mindriot I accepted your edit. (this time I was quick enough.) – Mackie Messer Jul 13 '15 at 07:58