9

I have an embedded Linux system running on an Atmel AT91SAM9260EK board on which I have two processes running at real-time priority. A manager process periodically "pings" a worker process using POSIX message queues to check the health of the worker process. Usually the round-trip ping takes about 1ms, but very occasionally it takes much longer - about 800ms. There are no other processes that run at a higher priority.

It appears the stall may be related to logging (syslog). If I stop logging the problem seems to go away. However it makes no difference if the log file is on JFFS2 or NFS. No other processes are writing to the "disk" - just syslog.

What tools are available to me to help me track down why these stalls are occurring? I am aware of latencytop and will be using that. Are there some other tools that may be more useful?

Some details:

  • Kernel version: 2.6.32.8
  • libc (syslog functions): uClibc 0.9.30.1
  • syslog: busybox 1.15.2
  • No swap space configured [added in edit]
  • root filesystem is on tmpfs (loaded from initramfs) [added in edit]
camh
  • 40,988
  • 13
  • 62
  • 70
  • LTTng is an option now, as explained here: http://lttng.org/blog/2015/02/04/web-request-latency-root-cause/ – camh Feb 05 '15 at 09:56

1 Answers1

2

The problem is (as you said) syslogd. While your process is running at a RT priority, syslogd is not. Additionally, syslogd does not lock its heap and can (and will) be paged out by the kernel, especially with very few 'customers'.

What you could try is:

  • Start another thread to manage a priority queue, have that thread talk to syslog. Logging would then just be acquiring a lock and inserting something into a list. Given only two subscribers, you should not spend a lot of time acquiring the mutex.

  • Not using syslog, implement your own logging (basically the first suggestion, minus talking to syslog).

I had a similar problem and my first attempt to fix it was to modify syslogd itself to lock its heap. That was a disaster. I then tried rsyslogd, which improved some but I still got random latency peaks. I ended up just implementing my own logging using a priority queue to help ensure that more critical messages were actually written first.

Note, if you are not using swap (at all), the shortest path to fixing this is probably implementing your own logging.

Tim Post
  • 33,371
  • 15
  • 110
  • 174
  • Not using swap. Furthermore, the rootfs is a tmpfs (initramfs) so the program text (busybox) would be paged from ram. – camh Apr 26 '10 at 08:59
  • @camh - I hate not making more suggestions to solve your problem at hand, but in the time it takes to investigate this, you could have implemented your own logging. – Tim Post Apr 26 '10 at 16:30
  • I dont think the problem is with syslog itself, but somewhere in the kernel. Busybox syslog is pretty simple, and reimplementing that wont vary too much, since I need the ability to collect logs from multiple applications into a single rotating log (based on size). Besides, I have to know why this is stalling. I can't have it come back in the field. Thanks for your answer. – camh Apr 26 '10 at 22:14
  • Though the problem in this case may be syslogd, the question is a bit more genereal - how to figure out where the stall is, and **why**. If he disables syslog, but later on needs to run something else (e web server, snmp daemon, a cron job, what not) - what could be done to figure out what's really causing the stalls ? As it's not swapping (he doesn't have swap configured) it must be something else - and normally context switching isn't in the order of 800ms on a high priority process so something must be going on. Not running syslogd/crond/snmpd/etc. might not be an option. – nos Dec 16 '13 at 14:35
  • @nos I think I will come back and maintain this answer a bit, there are also more modern alternatives that can work basically out of the box for this, depending on the need. Set a reminder to visit this over the weekend, when I spend my recreational time here. – Tim Post Dec 16 '13 at 14:41