Tool to visualize what function each thread is running

Question

I am trying to debug performance problem with a multi threaded C++ application. Basically my multi threaded program (10 threads) is way slower than single threaded one.

I have been trying tools like valgrind (callgrind), gprof and gdb. But so far I am unable to figure out where exactly a thread is being blocked and why. gprof and callgrind gives me overal time spend on each functions. But does this time include time a thread is blocked on something ? Are there any opensource tools which can be used for debugging this problem.

possible duplicate of [Profiling C++ multi-threaded applications](http://stackoverflow.com/questions/638090/profiling-c-multi-threaded-applications). I would recommend you to try vtune from Intel. And it seems you haven't take full advantage of what callgrind + hellgrind can offer. — UmNyobe, Jul 01 '13 at 10:24
Do you need only to find the bottleneck and source of halts? Please try [method described here](http://stackoverflow.com/a/378024/717732). It is one of the easiest, quickest to pefrorm, and yields very good results. Give it a try even if it seems "too trivial to really work". Only keep in mind that it will let you find out what waits. It wont tell you how long, what on and why. These you'll have to read/think out of your code. Still, finding what waits is usually a great start. — quetzalcoatl, Jul 01 '13 at 10:34
Tip : you can use the unix time command to estimate the context-switching. — lucasg, Jul 01 '13 at 11:17
@quetzalcoatl I did that. It did give me some pointers. Also I tried with perf tool which pointed me to kernel call clear_page being the most used function. I have pin pointed my problem to allocatio/deallocation of large number of objects per thread (lakhs). Using tcmalloc helped a bit, but performance is not where I would like it to be. — Jithin, Jul 04 '13 at 10:44

score 1 · Answer 1 · answered Jul 04 '13 at 11:59

Even if I don't have a ready to use answer, I'll switch from comments here, as there's some more space to write and format..

Could you clarify the "lahks" term? I found only something loosely related on Wiki, but its pure guess and I have no idea what you mean.

A large number of objects per thread you say. While you were sampling/stopping randomly, have you watched the stacktraces? I understand that the alloc/dealloc was the most often seen leaf of the stacktrace, but how about the *nonleaf*s? Have you been able to see what actually was calling that alloc/dealloc? That is the point in sampling method - to see the original of the call, and to statistically estimate which of the possible origins is responsible for calling it too often.

You might have not been able to observe the 'higher parts' of the stacktraces due to heavy optimization or due to architectural mismatch (ie. if your application uses task queuing, then most of the time you will only see "fetch task","check task","execute task" steps instead of true origins), but almost in every architecture you might adjust adequately (in terms of task queing - just try sampling the task registration!)

Yet another way - alloc/dealloc bloat is quite universal: it is usually related to architecture and algorithms, or, well, bugs. However, this kind of things should be easily observable not only in 'optimized release' build (where there's problem in seeing the stacktraces), but also should quickly show up in 'full debug info' builds too - with less optimizations the whole system will run slower, but you should be able to see and collect all intermediate methods that are possible origins.

Another thing: you've said that "multi threaded" works far slower than "single threaded". This arises a question on how are you able to switch between them? Do you have tw separate implementations? Or do you just adjust the threadpool size between 1 workerthread and N workerthreads? Crossing that with "alloc/dealloc" problem - maybe each of your threads is required to perform too many setups/teardowns at each time?

Try inspecting what actually the threads (as a group, look at the threads' lifetimes too) has to prepare repeatedly in contrast to the single threaded option.

For example, it be that the single threaded saves on alloc/dealloc somehow and maybe reuses some structures), while the N-threaded may require N-times the same structures. If the threads are just repeatedly started/stopped and not reused, then probably their N*data is not reused either, and so the N-threads may just be burning the time on preparations before actual work..

Also, if you managed to catch the extraneous allocation scheme - why not trace a little further: after stop, stepout of the allocator and try to see what is being allcoated. I mean, you may step and check what is being written to that memory, and that could give you further idea on what actually is happening. However, that may be a very laborious task, especially because it would have to be repeated many times.. I'd leave it as a last-resort.

Another thing is - pure guess - your platform may have some global lock inside the alloc/dealloc to "safely track" the memory management. This way, if all threads manage their own memory as they wish, the threads would wait for each other at every memory alloc/dealloc operation. Changing the memory allocation scheme, or using different memory manager, or using stack or TLS, or splitting the threadpool into separate processes may help as it will escape the need of global lock. But, that's just a very remote guess, and none of the solutions are easy to apply.

I'm sorry for such general and vague talk. It's hard to say anything more with only a few details you've provided. I purposely evade the "tool to visualize the jobs" topic. If you are unable to see what's happening just by the sample/stop method, then all the possible 'thread visualization' tools will most probably be not helpful: they will probably show you exactly the same what you have seen now, because they all analyze the same stacktraces, just a bit faster than stopping manually..

"lahks" was a typo. I meant lakhs (1/10th of a million). Few things I have tried since. Using tcmalloc from google perf tools made my progam fast over 400% but still there is thread contention. — Jithin, Jul 08 '13 at 12:18
Continuing from previous comment - Single thread take 250 ms while 4 threads, each thread doing the same amount of work takes 970ms.Test program is like this.A list of words are in file.When I run it with 1 thread it reads one word from file and process in 1 thread. When I run it with 4 threads it takes 4 words from file and process each word in separate thread. Profiling CPU with CPU profiler from gperftools led me to as this question http://stackoverflow.com/q/17524652/256400 — Jithin, Jul 08 '13 at 12:26
Sorry for naive question, but by "the same amount of work" you mean, the same 25% of original work, or the same 100% of original work..? 970ms is almost 4x the original 250ms, so that could mean that each of the threads is doing a FULL job and that the threads have fully serialized their runs.. But that would be odd unless you've forget to parition the data and also if you have somewhere some locking that completely prevents parallelism.. — quetzalcoatl, Jul 08 '13 at 12:42
each thread is doing same amount of work, 100% of original work. There is no data sharing or implicit locking involved. And so I have a problem of threads blocking each other which I am trying to solve. — Jithin, Jul 08 '13 at 14:07

Gianluca Ghettini · Answer 2 · 2013-07-04T13:05:51.597

1

one possibility is that you are running your multi-threaded code on a single core CPU :)

a common misconception about multi-threading is that you can gain speedup improvements just throwing threads to the problem: that's wrong unless you have a real multi-core CPU and a parallelizable problem (i.e. a problem that can be splitted in indipendently solvable sub-problems)

maybe you are working with a non-parallelizable problem (e.g. hash computation) or using I/O access (which is, again, non-parallelizable)

edited Jul 04 '13 at 13:05

answered Jul 04 '13 at 12:55

Gianluca Ghettini

11,129
19
93
159

That was one of my initial mistakes :) But later on I moved to a 4 core machine and still same store. However using tcmalloc helped a lot. And using gpertools CPU profiler led me to ask this question - http://stackoverflow.com/questions/17524652/performance-bottleneck-at-l-unlock-16 – Jithin Jul 08 '13 at 12:13

Tool to visualize what function each thread is running

2 Answers2