System is an embedded Linux/Busybox core on a small embedded board with a web server (Boa) running.
We are seeing some high latency in responses from the web server - sometimes >500ms for no good reason, so I've been digging...
On liberally scattering debug prints throughout the code it seems to come down to the entire process just... stopping for a bit, in a way which I can only assume must be the process/thread being interrupted by another process.
Using print statements and clock_gettime() to calculate time taken to process a request, I can see the code reach the bottom of a while() loop (parsing input), print something like "Time so far: 5ms"
and then the next line at the top of the loop will print "Time so far: 350ms"
- and all that the code does between the bottom of the loop and the 1st print back at the top is a basic check along the lines of while(position < end)
, it has nothing complicated that could hold it up.
There's no IO blocking, the data it's parsing has all arrived already, and it's not making any external calls or wandering off into complex functions.
I then looked into whether the kernel scheduler (CFS in our case) might be holding things up, adding calls to clock() (processor time rather than wall-clock) and again calculating time differences Vs processor time used I can see that the wall-clock time delay may run beyond 300ms from one loop to the next, but the reported processor time taken (which seems to have a ~10ms resolution) is more like 50ms.
So, that suggests the task scheduler is holding the process up for hundreds of milliseconds at a time. I've checked the scheduler granularity and max delay and it's nowhere near 100ms, scheduler latency is set at 6ms for example.
Any advice on what I can do now to try and track down the problem - identifying processes which could hog the CPU for >100ms, measuring/tracking what the scheduler is doing, etc.?