So I have this C++ program that basically parses giant dataset files and load contents into hashmap in memory (this part is being throttled in the main thread, so it never goes out of its way to take up giant chunk of time). And when that is done I flipped pointer to the new memory location, and call delete on the old one. Other than that the program is doing incoming request matching by looking up content in those in memory map (on main thread). Suppose those giant maps are wrapped in Evaluator
class:
Evaluator* oldEvaluator = mEvaluator;
Evaluator* newEvaluator = parseDataSet();
mEvaluator = newEvaluator;
delete oldEvaluator;
//And then on request processing:
mEvaluator.lookup(request)
The map can contain millions of string objects as keys. They're regular strings that could be request attributes like ip, UserAgent, etc but each is a string object inserted into the STL unordered_map.
The dataset is periodically updated but most of the time the program is just doing request attribute matching against the dataset in memory, and it's fine and efficient and no errors, except when bulk consumption of the new dataset happen. The alternative way of consuming this large dataset is to use streaming, but that's a relatively longer term solutions.
It used to be a single threaded program using event driven model but every time a complete new set is placed and destruction is called, it took too long to delete the whole thing and hence blocking the request processing.
So I put the deletion of such map onto a separate thread. Problem is while now the deletion and request processing appears to happen simultaneously I can see very visible, drastic slowdown on request processing thread.
Of course there are other processes running on the host and I do expect the 2 threads to compete for CPU cycles. But I didn't expect to see a drastic slow down on request matching thread. On average, a request should be processed 500us level but while the deletion thread was running it got as slow as 5ms. With sometimes cpu interrupts the matching thread (because it took way too long) it can go as long as 50ms, or 120ms, etc. In extreme cases a request could be taken the entire 1000ms to be processed, which is about the time the whole data structure deletion takes on another thread.
What is the best way to know the root cause of such slow down? Is it more of a CPU or memory bandwidth bottleneck? I was imagining as long as I put it on a separate thread I wouldn't care how slow it goes cos it has to delete string objects one by one after all, so I didn't expect it to affect the other thread...
EDIT: Thanks to couple comments/answers already seem to point out several possible causes:
- Memory fragmentation. Because less frequently visited string is stored in more expensive memory locations (so cache miss), or because it's stored in unordered_map with many pointers, or because system is doing memory compacting while deleting holes all over the place? But why exactly is this affecting slowness in another thread?
- One comment mentioned it's heap contention due to thread-safe locking? So the entire heap for this program locks down because one thread is busy deleting holes that prevents another's access of heap memory? Just to clarify, the program deliberately never allocates stuff and frees others at the same time, and it only has 2 threads, one dedicated for just deletion.
So what should I do then? I tried Jemalloc
though not sure I use it entirely correctly --- it seems including -ljemalloc
in linker line just magically replaces libc's malloc? I tried, with no performance difference but I could be using it wrong. My program doesn't do any explicit malloc, everything is new
with unknown size in advance, and hooked up together with pointers and STL maps.
And also all strings stored in Key are specifically used for quick lookup so they can't be stored in vector with index even though that would make contiguous memory space, it will be horrible to locate them. So,
- How can I find for sure the above 2 memory issues are the cause (any tools/metrics?)
- What can I do to fix it without changing my consumption model to streaming? Assuming the root causes were the above 2, seems like I should do either/both 2 things: 1) allocate all my STL maps along with the objects all from one pool? How do I do that? 2) reduce heap contention (I don't know if
Jemalloc
solves either of this in my case)