Deleting large hashmaps with millions of strings on one thread affects performance on another thread

Question

So I have this C++ program that basically parses giant dataset files and load contents into hashmap in memory (this part is being throttled in the main thread, so it never goes out of its way to take up giant chunk of time). And when that is done I flipped pointer to the new memory location, and call delete on the old one. Other than that the program is doing incoming request matching by looking up content in those in memory map (on main thread). Suppose those giant maps are wrapped in Evaluator class:

Evaluator* oldEvaluator = mEvaluator;
Evaluator* newEvaluator = parseDataSet();
mEvaluator = newEvaluator;
delete oldEvaluator;

//And then on request processing:
mEvaluator.lookup(request)

The map can contain millions of string objects as keys. They're regular strings that could be request attributes like ip, UserAgent, etc but each is a string object inserted into the STL unordered_map.

The dataset is periodically updated but most of the time the program is just doing request attribute matching against the dataset in memory, and it's fine and efficient and no errors, except when bulk consumption of the new dataset happen. The alternative way of consuming this large dataset is to use streaming, but that's a relatively longer term solutions.

It used to be a single threaded program using event driven model but every time a complete new set is placed and destruction is called, it took too long to delete the whole thing and hence blocking the request processing.

So I put the deletion of such map onto a separate thread. Problem is while now the deletion and request processing appears to happen simultaneously I can see very visible, drastic slowdown on request processing thread.

Of course there are other processes running on the host and I do expect the 2 threads to compete for CPU cycles. But I didn't expect to see a drastic slow down on request matching thread. On average, a request should be processed 500us level but while the deletion thread was running it got as slow as 5ms. With sometimes cpu interrupts the matching thread (because it took way too long) it can go as long as 50ms, or 120ms, etc. In extreme cases a request could be taken the entire 1000ms to be processed, which is about the time the whole data structure deletion takes on another thread.

What is the best way to know the root cause of such slow down? Is it more of a CPU or memory bandwidth bottleneck? I was imagining as long as I put it on a separate thread I wouldn't care how slow it goes cos it has to delete string objects one by one after all, so I didn't expect it to affect the other thread...

EDIT: Thanks to couple comments/answers already seem to point out several possible causes:

Memory fragmentation. Because less frequently visited string is stored in more expensive memory locations (so cache miss), or because it's stored in unordered_map with many pointers, or because system is doing memory compacting while deleting holes all over the place? But why exactly is this affecting slowness in another thread?
One comment mentioned it's heap contention due to thread-safe locking? So the entire heap for this program locks down because one thread is busy deleting holes that prevents another's access of heap memory? Just to clarify, the program deliberately never allocates stuff and frees others at the same time, and it only has 2 threads, one dedicated for just deletion.

So what should I do then? I tried Jemalloc though not sure I use it entirely correctly --- it seems including -ljemalloc in linker line just magically replaces libc's malloc? I tried, with no performance difference but I could be using it wrong. My program doesn't do any explicit malloc, everything is new with unknown size in advance, and hooked up together with pointers and STL maps.

And also all strings stored in Key are specifically used for quick lookup so they can't be stored in vector with index even though that would make contiguous memory space, it will be horrible to locate them. So,

How can I find for sure the above 2 memory issues are the cause (any tools/metrics?)
What can I do to fix it without changing my consumption model to streaming? Assuming the root causes were the above 2, seems like I should do either/both 2 things: 1) allocate all my STL maps along with the objects all from one pool? How do I do that? 2) reduce heap contention (I don't know if Jemalloc solves either of this in my case)

If you have a hash map with millions of strings then surely your memory might be terribly fragmented. Consider storing the strings cumulatively in some containers. And make the hashmap be of `std::string_view` rather than `std::string`. Other option is to use std::pmr. — ALX23z, Feb 27 '20 at 09:29
Can you store your string into au vector, and store the index of the string into the map ? It add one indirection but you can process all the string (delete it) without cache fault — Martin Morterol, Feb 27 '20 at 09:32
@ALX23z Any recommendations on readings on why fragmented memory deletion is a problem? Is it just because it suddenly became CPU intensive or something? (like it tries to memory re-ordering while deleting those fragments?) I didn't know anything about std::string_view or std::pmr, thanks for point them out I will take a look! — Superziyi, Feb 27 '20 at 09:37
@MartinMorterol Oh yeah I can just use vector instead of hand-rolling memory block with indexes (just checked vector does guarantee contiguous memory). I should definitely look into that if that proves to be the problem. I guess I'm lacking some knowledge here so am asking readings/references/measurement tools to understand why and how deleting fragmented memory pieces slows things down. (I just naively thought it can delete slowly on a separate thread and I wouldn't need to care) — Superziyi, Feb 27 '20 at 09:42
@Superziyi since my comment seem to be usefull I have made an formal answer for futurs users — Martin Morterol, Feb 27 '20 at 09:50
@MartinMorterol Thank you very much! I'll have a good read and try to understand on the related post you shared and give your answer feedback! — Superziyi, Feb 27 '20 at 09:52
deletion may not be the problem *per-se*, of course managing hundred of megabytes cannot be a instantaneous task, but compete with another thread that also uses allocation will possibly trash the virtual memory tables... I not sure, but it seems that you need to rethink the way you need to do things (may be manage the memory by yourself with your own allocators, etc). Allocating/Deallocating very intensively is a bad behaviour. — Jean-Baptiste Yunès, Feb 27 '20 at 09:55
@Jean-BaptisteYunès Yeah so allocation and deallocation doesn't happen as often compared to request matching but still every time it happens request suffers. For allocation (building the map in memory) it was better cos that operation can be done incrementally (I just put a throttle on that operation and let it take turns with request processing) but deletion can't be done this way. Yeah so to clarify deletion never happens at the same time with the in memory map building. — Superziyi, Feb 27 '20 at 09:59
What does your data look like? How large are the keys and the values? How do the data sets differ? Perhaps there is a better way to store it than a key-value map. — VLL, Feb 27 '20 at 10:35
@vll The actual data structure is more complex. I'm largely simplifying it to say it's a key-value map with keys being a million strings, which is the core of 90% what's stored in memory. Strings are just normal strings of length less than hundred characters...they are used to match request attributes. For example, the string can be ip address, useragent, etc. — Superziyi, Feb 27 '20 at 21:45
Keep in mind that the C++ run-time's heap is a shared data structure, and therefore accesses to the heap (i.e. memory-allocations and memory-frees) are likely being serialized with a mutex (or similar) in most cases, to avoid corrupting the heap's metadata during multithreaded operation. To avoid that bottleneck, you might investigate allocating your ginormous data structure on its own private heap, so that the rest of your program can continue running unmolested when you free all that data. (You might even be able to make the teardown an O(1) operation that simply resets its heap to 'blank') — Jeremy Friesner, Mar 02 '20 at 04:48
@JeremyFriesner Oh so it's contention on the entire heap? I did some search seems to suggest that comes into play when there are frequent allocation and free happening at the same time. In my case, it's just free on one thread, and accessing other memory on the heap from another thread, that also matters I guess? I'm trying jemalloc but doesn't seem to have an effect. Do I need to write my own pool? I looked into Boost pool doesn't seem to fit my variable-size string storage use case. Also what would be a good way to "reset heap"? All sizes are dynamic so I can't malloc with size in advance. — Superziyi, Mar 02 '20 at 05:21
@PhilBrubaker Unfortunately won't be very easy to do. The main program uses libuv event based framework and I'm forking another thread to do deletion via `uv_queue_work` — Superziyi, Mar 02 '20 at 09:23
@Superziyi Rather than forking off the thread and deleting the map in one mass operation, perhaps you could delete X elements at a time and delay between each iteration. — Phil Brubaker, Mar 02 '20 at 09:31
That's exactly how it's done for allocation actually, it is throttled, so it's inside the same thread as the request matching thread. But for deletion, as you can see from the code it's just a top level `delete` call and it takes care of calling destructors several levels deep down. Not saying it can't be done but it's not like reading a file for certain buffer size each time, it's hard to track when I'm just calling destructors... — Superziyi, Mar 02 '20 at 10:05
@Superziyi merely reading or writing to already-allocated heap memory won't be serialized, but e.g. thread A's call to `new` or `free()` could hold off thread B's call to `new` or `free`. — Jeremy Friesner, Mar 02 '20 at 15:13
Use a profiler to find the bottleneck, e.g. `perf record -g -cycles:ppp ` and then `perf report` as a start. Or attach `perf record` when you destroy the old cache and then detach it. It is much faster and most accurate than soliciting guesses based on your description and no code. — Maxim Egorushkin, Mar 03 '20 at 13:06

score 5 · Answer 1 · answered Mar 02 '20 at 13:01

5

It might be worthwhile to store just a single std::string for all your data combined, and use std::string_view in the map. This eliminates mutex contention as there's only one memory allocation needed. string_view has a trivial destructor so you don't need a thread for that.

I've successfully used this technique before to speed up a program by 2500%, but that was also because this technique reduced the total memory usage.

answered Mar 02 '20 at 13:01

MSalters

173,980
10
155
350

Still allocation is not the problem here. I edited the post to clarify that allocation is done incrementally in throttled fashion. The appeared contention seems to be between one thread just accessing these strings in memory, versus another thread deleting other strings that were allocated in other parts of this heap. Could it be due to large cache miss, too many string destruction requires getting strings from RAM to cache, causing cache contention (trying to load the string to delete to the same cache line the request matching thread tries to access for its string)? Does that sound likely? – Superziyi Mar 03 '20 at 06:43
@Superziyi If you allocate just one string, you also have to deallocate just one string. That might be faster. – VLL Mar 03 '20 at 09:24
@Superziyi: String destruction should not require touching the string content itself. But the hashmap will have the strings scattered in memory, and you'll have many cache misses (of course - millions of strings won't fit in cache.). Also, accessing string content will not require a heap mutex lock but it will take cache. – MSalters Mar 03 '20 at 09:31

Martin Morterol · Answer 2 · 2020-02-27T09:52:05.567

3

You can try using a std::vector for storing the memory. std::vector elements are stored contiguously, so it will reduce cache miss (see What is a "cache-friendly" code?)

So you will have a map<???,size_t> instead of map<???,std::string> you will have one more indirection to get your string (wich means an extra run time cost) but it allow you to iterate on all strings with way less cache-miss.

edited Feb 27 '20 at 09:52

answered Feb 27 '20 at 09:47

Martin Morterol

2,560
1
10
15

Oh just realized I forgot to mention my strings are stored as key, and used in look up...so that's a bit of bummer. Sorry I didn't make it clear at first – Superziyi Feb 27 '20 at 10:11

Sigi · Answer 3 · 2020-03-03T12:03:09.050

It would be great if you recreate the problem you are encountering with a MVCE and show it: you know, many times the problem you are thinking is your problem... is not the problem.

How can I find for sure the above 2 memory issues are the cause (any tools/metrics?)

Given the information here I would suggest to use a profiler - gprof (compile with -g -pg) being the basic one. If you have the Intel compiler available you can use vtune.

There is a free version of vtune but I have personally used the commercial version only.

Besides from this you can insert timings in your code: from the textual description, it's not clear whether the time to populate the map is comparable to the time needed to erase it, or it grows consistently when run concurrently. I would start with if. Note that the current version of malloc() is greatly optimized for concurrency too (is this Linux? - add a tag to the question please).

For sure when you erase the map there are millions of free()'s called by std::~string() - but you need to be sure that this is the problem or not: you can use a better approach (many mentioned in the answers/comments) or a custom allocator backed by a huge memory block that you create/destroy as a single unit.

If you provide a MVCE as a starting point, I or others will be able to provie a consistent answer (this is not an answer, yet - but too long to be a comment)

Just to clarify, the program deliberately never allocates stuff and frees others at the same time, and it only has 2 threads, one dedicated for just deletion.

Keep in mind that each string in the map needs one (ore more) new and one delete (based on malloc() and free() respectively), being the strings either in the keys or in the values.

What do you have in the "values" of the map?

Since you have a map<string,<set<int>> you have many allocations: Every time you perform a map[string].insert(val) of a new key, your code implicitly call malloc() for both the string and the set. Even if the key is in the map already, a new int in the set require a new node in the set to be allocated.

So you have really many allocations while building the structure: your memory is very fragmented on one side, and your code seems really "malloc intensive", that could in principle lead to the memory calls to starve.

Multithreaded memory allocations/deallocations

One peculiarity of modern memory subsystems, is that thay are optimized for multi-core systems: when one thread allocates memory on one core, there is not a global lock, but a thread-local or core-local lock for a thread-local pool.

This means that when one thread needs to free the memory allocated by another one, there is a non-local (slower) lock involved.

This means that the best approach is that each thread allocates/deallocates its own memory. Said that in principle you can optimize a lot your code with data structures that require less malloc/free interactions, your code will be more local, with respect to memory allocations, if you let each thread:

get one block of data
build the map<string,<set<int>>
free it

And you have two threads that, repeatedly perform this task.

NOTE: you need enoguht RAM to handle concurrent evaluators, but now already you are using 2 of them concurrently loaded with a double buffering scheme (one filling, one cleaning). Are you sure your system is not swapping because of RAM exahustion?

Furthermore, this approach is scalable: you can use as many threads as you want. In your approach you were limited to 2 threads - one building the structure, one destorying it.

Optimizing

Without a MVCE it's a hard task to give directions. Just ideas that you only know whether can be applied at now:

replace the set with sorted vector, reserved at creation time
replace the map keys with a flat vector of equally spaced, sorted strings
store the string keys sequentially in a flat vector, add hashes to keep track of the keys of the map. Add an hash-map to keep track of the ordering of the strings in the vector.

I have timing inserted so that's why I could describe my observations (fwiu that's what profiler for) and I edited the post to reflect that populating the map isn't a concern because that's done incrementally in a throttled fashion, on the same main thread where request matching is happening. I just can't do the same throttle for deletion (because I'm not hand-rolling data structures). I'm not allocating and freeing at the same time on heap, but accessing L3/RAM from 2 threads can cause contention? The values are set of integers, so `map>`. I'll try to work on an MVCE. Thanks! — Superziyi, Mar 03 '20 at 06:21
It's not clear whether the deletion process is slow occasionally even sequentially - and it can happen because of heap reorganizations/fragmentation - or it happens only when multithreaded - I've added some info in the answer with some more ideas. — Sigi, Mar 03 '20 at 09:11
"accessing L3/RAM from 2 threads can cause contention?" - This would eventually be related to memory usage by different threads, not malloc()/free() — Sigi, Mar 03 '20 at 09:23

score 0 · Accepted Answer · answered Apr 25 '20 at 03:22

So thanks to all the answers and comments given, I wasn't able to pick out a best due to partly the problem itself being vague and no one single answer really covered everything. But I did learn a lot from these answers and hence upvoted most of them. Here's what I found after various experiments, that the major issues are:

The reason slow operation on deletion thread affects another. Given it doesn't do malloc/dealloc concurrently on both threads, there should not be any heap contentions, nor is general CPU or available memory at bottleneck, the only plausible explanation left is memory bandwidth exhaustion. I found this answer to another post says: it's generally possible for a single core to saturate the memory bus if memory access is all it does. All my deletion thread does is traverse a giant map and delete each element in it, so it's conceivable it saturates memory bus so the other thread, which is doing both memory access and other computation, drastically slows down. From here on I'll focus on various reasons this deletion can be slow
The map is giant, with millions of elements and hundreds of megabytes in size. Deleting every one of them requires accessing them first and clearly very few can even fit in L1/L2/L3 cache. So there's a ton of cache miss and fetch from RAM.
As couple answers/comments mentioned here, I store std::string objects in map. Each allocated with its own space and has to be fetched and deleted one by one. The advise from MSalters improves performance much better by storing string_view in map, while storing actual byte content of each string, in a contiguous memory block pre-allocated. Now the deletion of a million objects in the map becomes almost trivial destructions of string_view objects which are merely pointers, and the destruction of all string contents are destruction of that pre-allocated block.
I didn't mention in some other parts of the program I also store other C++ objects in other maps. And they're likewise problematic. Similar "flattening" of such C++ objects are necessary, albeit harder to do without ready-made classes like string_view. The idea is if we can store as much as primitive types and pointers as we can, and put all content (most of them can be boiled down to strings) in contiguous bytebuffers. Making everything trivial to destroy is the goal.
Lastly, it turns out the map container itself can be pretty costly to destroy particularly when it's big. For Node-based std containers traversing and deleting each node handle takes time. What I found is alternative implementations of truly flat hashmap, will make the deletion a lot faster. Examples of such map include Abseil flat_hash_map and this blogger's flat_hash_map. Note that they're both true hash_maps even though they're flat. Boost's flat_map also can be deleted very fast but it's not a real hashMap, it's backed by strictly ordered vector which makes insertion (when my input is not ordered) extremely slow.

Marko Bencik · Answer 5 · 2020-03-02T09:24:45.407

this will be a lengthy answer because your question is very complicated.

Read procedure

When you read something you start to allocate memory into you app. Now this is ok in a normal case when you do not need performance there is where the problems begin.

STL maps are red-black trees so they have a lot of pointers, which means each element is /was allocated individually, this creates a situation that your memory space is very fragmented and it is difficult for the system to deallocate the elements efficiently. Reason: the system has to follow the pointers.

The appropriate container

STL map explained: Why is std::map implemented as a red-black tree?

Here is a basic discussion about the behavior of the map memory management. https://bytes.com/topic/c/answers/763319-stl-map-memory-management

Acording to your description you read a massive file that you then sequentially stream to someone. My question here is can dis data be stored as a STL Pair into continuous memory, since you say you have to stream it?

Do you have to search for elements in there ? If yes then you should find out how often or in which frequency, this answer will tell you if the STL map is a good container since it is efficient in searching activities.

Now in this link there a re some benchmarks about pointer referenced containers and continuous containers. https://baptiste-wicht.com/posts/2012/12/cpp-benchmark-vector-list-deque.html

The idea is that you use the appropriate container so you have the right behavior of the memory management.

Is there any advantage of using map over unordered_map in case of trivial keys? Here is an alternative to your map that could be a cheap fast hack until you develop a more precise solution.

Memory management

My question in your problem is can you clear and reuse your container? Since freeing containers is expensive business.

You could you use a ring buffer of STL maps where: one is read -> one ready -> one written That would be very efficient and could give you the edge since you would not have to free any buffers, just clear after usage.

Edit: Here is an answer about memory fragmentation that happens during frequent deletes in a container. What is memory fragmentation?

You problem is that you use strings, they can extend the memory but underneath they are mallocs of char. Now I would not delete stuff but flag it unused or something else.

One tiny thing that might help if you use string reserve function when you create your strings. Then you can say 128, which means 128 bytes and will eat a bit of memory but will make fragmentation handling easier, and reallocation behavior of the string less difficult.

Now this might also be totally useless. You need to profile your app to see what is going on best way perf and Flamgraphs if you are on Linux.

Thank you! Sorry if I didn't make it clear, I meant "streaming" in the sense of an alternative solution to bulk dataset update (allocate new and destroy old in memory), I've edited that. My use case for these map is really just for fast lookup which is also why unordered_map is used for a container storing millions of possible strings. The problem of reusing this memory is it needs to be allocated in contiguous manner (but each string object allocated separately so), then I need to manually do malloc and knowing the size up front. I can't use vector cos string as key has to be looked up. — Superziyi, Mar 02 '20 at 03:50
And yeah all data needs to be present to ensure accuracy, so can't do ring buffer. It's just pretty standard hashmap use case — Superziyi, Mar 02 '20 at 04:37

Deleting large hashmaps with millions of strings on one thread affects performance on another thread

5 Answers5

What do you have in the "values" of the map?

Multithreaded memory allocations/deallocations

Optimizing