Finding performance issue that may be due to thread locking (possibly)

Question

I've spent a little time running valgrind/callgrind to profile a server that does a lot of TCP/IP communications using many threads. After some time improving the performance, I realised that in this particular test scenario, the process is not CPU bound so the performance "improvements" I'd looked at were of no use.

In theory, the CPU should be very busy. I know the TCP/IP device it connects to isn't the limitation as the server runs on two machines. One is a PC the other is an embedded device with an Arm processor. Even the embedded device only gets to about 2% CPU usage but it does far fewer transactions - about a tenth. Both systems only get up to about 2% even though we're trying to get data as fast as possible.

My guess is that some mutex is locked and is holding up a thread. This is a pure guess! There are a few threads in the system with common data. Perhaps there are other possibilities but how do I tell?

Is there anyway to use a tool like valgrind/callgrind that might show the time spent in system calls? I can also run it on Windows with Visual Studio 2012 if that's better.

We might have to try walking through the code or something but not sure that we have time.

Any tips appreciated.

Thanks.

Hmmm - wonder if this will help - --collect-systime= [default: no] This specifies whether information for system call times should be collected — Peter S, Jun 03 '14 at 18:07
If you have VS2012 at your disposal, perhaps [the built-in profiler](http://msdn.microsoft.com/en-us/library/ms182372(v=vs.110).aspx) may be of some value. — WhozCraig, Jun 03 '14 at 18:11
Just realised - I have VS2012 but not the premium or ultimate version so I don't think I have that option. — Peter S, Jun 03 '14 at 18:55
Run the program within `gdb`, interrupt the program a few times and take a look at the stack traces of all threads. I expect that either all threads are waiting on blocking network operations, or some threads are waiting on locking a mutex that is hold by a thread while doing a blocking network operation. — nosid, Jun 03 '14 at 19:24

score 6 · Accepted Answer · edited May 23 '17 at 12:01

Callgrind is a great profiler but it does have some drawbacks. In particular, it assumes that the same instruction always executes in the same amount of time, and it assumes that instruction counts are the most important metric.

This is fine for getting (mostly) reproducible profiling results and for analyzing in detail what instructions are executed, but there are some types of performance problems which Callgrind doesn't detect:

time spent waiting for locks
time spent sleeping (eg. simple sleep()/usleep() calls will effectively slow down your application but won't show up in Callgrind)
time spent waiting for disk I/O or network I/O
time spent waiting for data that was swapped out
influences from CPU cache hits/misses (you can try to use Cachegrind for this particular topic)
influences from CPU pipeline stalls, branch prediction failures and all the other features of modern CPUs that can cause the same instruction to be executed faster or slower depending on the context

These problems can be detected quite well using a statistical (or sample-based) profiler. Examples would be Sysprof and OProfile, or any kind of "poor-man's sampling profiler" as described eg. at https://stackoverflow.com/a/378024. The VS2012 built-in profiler mentioned by WhozCraig appears to be a sampling profiler as well.

While statistical profilers are really useful because they provide "real-world" results instead of simple instructions counts, they have the possible drawback that you don't get reproducible results easily (the results will vary a little bit with every run), and that you need to gather sufficient number of samples to get detailed results.

Finding performance issue that may be due to thread locking (possibly)

1 Answers1