Even if I don't have a ready to use answer, I'll switch from comments here, as there's some more space to write and format..
Could you clarify the "lahks" term? I found only something loosely related on Wiki, but its pure guess and I have no idea what you mean.
A large number of objects per thread
you say. While you were sampling/stopping randomly, have you watched the stacktraces? I understand that the alloc/dealloc was the most often seen leaf of the stacktrace, but how about the *nonleaf*s? Have you been able to see what actually was calling that alloc/dealloc? That is the point in sampling method - to see the original of the call, and to statistically estimate which of the possible origins is responsible for calling it too often.
You might have not been able to observe the 'higher parts' of the stacktraces due to heavy optimization or due to architectural mismatch (ie. if your application uses task queuing, then most of the time you will only see "fetch task","check task","execute task" steps instead of true origins), but almost in every architecture you might adjust adequately (in terms of task queing - just try sampling the task registration!)
Yet another way - alloc/dealloc bloat is quite universal: it is usually related to architecture and algorithms, or, well, bugs. However, this kind of things should be easily observable not only in 'optimized release' build (where there's problem in seeing the stacktraces), but also should quickly show up in 'full debug info' builds too - with less optimizations the whole system will run slower, but you should be able to see and collect all intermediate methods that are possible origins.
Another thing: you've said that "multi threaded" works far slower than "single threaded". This arises a question on how are you able to switch between them? Do you have tw separate implementations? Or do you just adjust the threadpool size between 1 workerthread and N workerthreads? Crossing that with "alloc/dealloc" problem - maybe each of your threads is required to perform too many setups/teardowns at each time?
Try inspecting what actually the threads (as a group, look at the threads' lifetimes too) has to prepare repeatedly in contrast to the single threaded option.
For example, it be that the single threaded saves on alloc/dealloc somehow and maybe reuses some structures), while the N-threaded may require N-times the same structures. If the threads are just repeatedly started/stopped and not reused, then probably their N*data is not reused either, and so the N-threads may just be burning the time on preparations before actual work..
Also, if you managed to catch the extraneous allocation scheme - why not trace a little further: after stop, stepout of the allocator and try to see what is being allcoated. I mean, you may step and check what is being written to that memory, and that could give you further idea on what actually is happening. However, that may be a very laborious task, especially because it would have to be repeated many times.. I'd leave it as a last-resort.
Another thing is - pure guess - your platform may have some global lock inside the alloc/dealloc to "safely track" the memory management. This way, if all threads manage their own memory as they wish, the threads would wait for each other at every memory alloc/dealloc operation. Changing the memory allocation scheme, or using different memory manager, or using stack or TLS, or splitting the threadpool into separate processes may help as it will escape the need of global lock. But, that's just a very remote guess, and none of the solutions are easy to apply.
I'm sorry for such general and vague talk. It's hard to say anything more with only a few details you've provided. I purposely evade the "tool to visualize the jobs" topic. If you are unable to see what's happening just by the sample/stop method, then all the possible 'thread visualization' tools will most probably be not helpful: they will probably show you exactly the same what you have seen now, because they all analyze the same stacktraces, just a bit faster than stopping manually..