I have a multithreaded application in which my thread utilization is very poor (in the ball park of 1%-4% per thread, with fewer threads than processors). In the debugger, it appears to be spending a lot of time in vector::push_back, specifically the placement new that occurs during the push_back. I've tried using reserve to avoid having the vector expand its capacity and copy everything, but that doesn't appear to be the problem. Commenting out the vector::push_backs leads to much better thread utilization.
This problem is occurring with vectors of uint64_t, so it does not appear to be the result of complicated object construction. I have tried using both the standard allocator and a custom allocator and both perform the same way. The vectors are being used by the same thread that allocated them.