Garbage collection with a very large dictionary

Question

I have a very large immutable set of keys that doesn't fit in memory, and an even larger list of references, which must be scanned just once. How can the mark phase be done in RAM? I do have a possible solution, which I will write as an answer later (don't want to spoil it), but maybe there are other solutions I didn't think about.

I will try to restate the problem to make it more "real":

You work at Facebook, and your task is to find which users didn't ever create a post with an emoji. All you have is the list of active user names (around 2 billion), and the list of posts (user name / text), which you have to scan, but just once. It contains only active users (you don't need to validate them).

Also, you have one computer, with 2 GB of RAM (bonus points for 1 GB). So it has to be done all in RAM (without external sort or reading in sorted order). Within two day.

Can you do it? How? Tips: You might want to use a hash table, with the user name as the key, and one bit as the value. But the list of user names doesn't fit in memory, so that doesn't work. With user ids it might work, but you just have the names. You can scan the list of user names a few times (maybe 40 times, but not more).

A Bloom filter comes to mind, but you're going to get some false positives. With a gigabyte to play with, and only 2 billion items, the false positive rate will be very small, but not zero. — Jim Mischel, Jun 17 '17 at 13:48
I didn't think about that! But I guess the memory usage would be too large: with 1% false positives, it would require about 10 bits per key I think. — Thomas Mueller, Jun 19 '17 at 08:44
A Bloom filter to handle 2 billion items with a false positive rate of 1% would require about 2.35 gigabytes ( 1.18 bits per key), and 7 hash functions. See https://hur.st/bloomfilter?n=2000000000&p=0.01, for example. — Jim Mischel, Jun 19 '17 at 13:01
@JimMischel yes, this is 9.58 bits per key (19,170,116,755 bits in the filter, 2 billion keys), and 1.18 _bytes_ per key. — Thomas Mueller, Jun 19 '17 at 15:17
Yeah, you're right. My brain misfired on that one, didn't it? I wonder if there's a more memory efficient implementation. — Jim Mischel, Jun 19 '17 at 15:44
Now here is a contradiction: `can read it about 100 times` or `maybe ten times, but not more`. — Evgeny Kluev, Jun 21 '17 at 10:34
@EvgenyKluev I'm sorry, you are right. 10 times makes it quit complex; using a limit of 40 times should make it simpler. I have updated the question. — Thomas Mueller, Jun 21 '17 at 10:47

score 1 · Answer 1 · answered Jun 16 '17 at 07:17

Sounds like a problem I tackled 10 years ago.

The first stage: ditch GC. The overhead of GC for small objects (a few bytes) can be in excess of 100%.

The second stage: design a decent compression scheme for user names. English has about 3 bits per character. Even if you allowed more characters, the average amount of bits won't rise fast.

Third stage: Create dictionary of usernames in memory. Use a 16 bit prefix of each username to choose the right sub-dictionary. Read in all usernames, initially sorting them just by this prefix. Then sort each dictionary in turn. As noted in the question, allocate one extra bit per username for the "used emoji" result.

The problem is now I/O bound, as the computation is embarrassingly parallel. The longest phase will be reading in all the posts (which is going to be many TB).

Note that in this setup, you're not using fancy data types like String. The dictionaries are contiguous memory blocks.

Given a deadline of two days, I would however dump some of this this fanciness. The I/O bound for reading the text is severe enough that the creation of the user database may exceed 16 GB. Yes, that will swap to disk. Big deal for a one-off.

Thanks, this is interesting! Can you do it with just 1 GB of RAM, that means with just 4 bits per key? By the way, I don't actually do GC for small objects. But I do know a use case to do GC with files (Jackrabbit FileDataStore). — Thomas Mueller, Jun 16 '17 at 07:23
@ThomasMueller: It rapidly becomes counter-productive. But note that I don't need to store the first 16 bits of the compressed user name in the sub-dictionaries themselves. Those 16 bits are simply not stored (!) I suspect you're thinking of a trie, but the problem is that pointers are bloody big things. A 4 bit key with a 64 bit pointer isn't saving a lot. — MSalters, Jun 16 '17 at 07:34
I'm thinking about a [minimum perfect hash table algorithm I wrote](https://github.com/thomasmueller/minperf). Now I want to write an article about it, and need to have a useful and watertight use case. I see now that 16 GB of RAM was too much freedom :-) — Thomas Mueller, Jun 16 '17 at 07:41

Evgeny Kluev · Answer 2 · 2017-06-21T20:37:45.270

Hash the keys, sort the hashes, and store sorted hashes in compressed form.

TL;DR

The algorithm I propose may be considered as an extension to the solution for similar (simpler) problem.

To each key: apply a hash function that maps keys to integers in range [0..h]. It seems to be reasonably good to start with h = 2 * number_of_keys.
Fill all available memory with these hashes.
Sort the hashes.
If hash value is unique, write it to the list of unique hashes; otherwise remove all copies of it and write it to the list of duplicates. Both these lists should be kept in compressed form: as difference between adjacent values, compressed with optimal entropy coder (like arithmetic coder, range coder, or ANS coder). If the list of unique hashes was not empty, merge it with sorted hashes; additional duplicates may be found while merging. If the list of duplicates was not empty, merge new duplicates to it.
Repeat steps 1..4 while there are any unprocessed keys.
Read keys several more times while performing steps 1..5. But ignore all keys that are not in the list of duplicates from previous pass. For each pass use different hash function (for anything except matching with the list of duplicates from previous pass, which means we need to sort hashes twice, for 2 different hash functions).
Read keys again to convert remaining list of duplicate hashes into list of plain keys. Sort it.
Allocate array of 2 billion bits.
Use all unoccupied memory to construct an index for each compressed list of hashes. This could be a trie or a sorted list. Each entry of the index should contain a "state" of entropy decoder which allows to avoid decoding compressed stream from the very beginning.
Process the list of posts and update the array of 2 billion bits.
Read keys once more co convert hashes back to keys.

While using value h = 2*number_of_keys seems to be reasonably good, we could try to vary it to optimize space requirements. (Setting it too high decreases compression ratio, setting it too low results in too many duplicates).

This approach does not guarantee the result: it is possible to invent 10 bad hash functions so that every key is duplicated on every pass. But with high probability it will succeed and most likely will need about 1GB RAM (because most compressed integer values are in range [1..8], so each key results in about 2..3 bits in compressed stream).

To estimate space requirements precisely we might use either (complicated?) mathematical proof or complete implementation of algorithm (also pretty complicated). But to obtain rough estimation we could use partial implementation of steps 1..4. See it on Ideone. It uses variant of ANS coder named FSE (taken from here: https://github.com/Cyan4973/FiniteStateEntropy) and simple hash function implementation (taken from here: https://gist.github.com/badboy/6267743). Here are the results:

Key list loads allowed:     10           20
Optimal h/n:                 2.1          1.2
Bits per key:                2.98         2.62
Compressed MB:             710.851      625.096
Uncompressed MB:            40.474        3.325
Bitmap MB:                 238.419      238.419
MB used:                   989.744      866.839
Index entries:           1'122'520    5'149'840
Indexed fragment size:    1781.71       388.361

With the original OP limitation of 10 key scans optimal value for hash range is only slightly higher (2.1) than my guess (2.0) and this parameter is very convenient because it allows using 32-bit hashes (instead of 64-bit ones). Required memory is slightly less than 1GB, which allows to use pretty large indexes (so step 10 would be not very slow). Here lies a little problem: these results show how much memory is consumed at the end, but in this particular case (10 key scans) we temporarily need more than 1 GB memory while performing second pass. This may be fixed if we drop results (unique hashes) of the first first pass and recompute them later, together with step 7.

With not so tight limitation of 20 key scans optimal value for hash range is 1.2, which means algorithm needs much less memory and allows more space for indexes (so that step 10 would be almost 5 times faster).

Loosening limitation to 40 key scans does not result in any further improvements.

This is close to the answer I have in mind. I think it will use a bit too much RAM, but without having a real implementation this is hard to prove. — Thomas Mueller, Jun 20 '17 at 18:59
I have a question: as far as I understand, steps 1.-6. will result in at least 2 (probably more) compressed lists. RIght? Total number of entries is 2 billion. Then at step 10, how do you know which of those lists to use? — Thomas Mueller, Jun 21 '17 at 06:48
Actually (if it is allowed to read keys 10 times) there would be exactly 9 lists: 8 with hashes and 1 with plain keys. Each next list would be less than half the size of previous one (and each one produced using different hash function). At step 10 we have no idea which list contains given key, so we should use every list, starting with the largest one (but we could stop when the key is found, so on average less than two lists would be searched). — Evgeny Kluev, Jun 21 '17 at 09:06
Thanks! I will try to run the numbers for this solution. I have updated the number of times you can load the list of users, the original 10 was a bit strict and 100 is a bit high. — Thomas Mueller, Jun 21 '17 at 10:56
So this could work! With 1.44*number_of_keys, the probability of the hash being unique is 50%. That way you can use one bit per key instead of arithmetic coding; this needs 2/ln(2) bits/key (uncompressed). Generation is a bit tricky, and lookup is a bit slow (having to count bits), but should be OK with an index. Generation is a bit tricky, but should be fine. — Thomas Mueller, Jun 21 '17 at 19:14
You are right, optimal coding is not necessary. Not one bit per key but one bit per hash value or 1.44 bits per key and the same amount to recursively compress duplicates, so 2.88. Optimal coding needs only 2.7, but your approach is much simpler. Less remaining memory for indexes is not a big problem because you'll need much smaller index element. And you will not need more than 20 passes as well. — Evgeny Kluev, Jun 21 '17 at 19:57
Your algorithm almost matches [BBHash](https://github.com/rizkg/BBHash). Even the 1.44 (ln(2)^-1) is used in the paper at "Space Usage During Construction". Disk usage is discussed there as well. — Thomas Mueller, Jun 23 '17 at 12:01

score 1 · Accepted Answer · answered Jun 22 '17 at 05:56

Minimal perfect hashing

Create a minimal perfect hash function (MPHF). At around 1.8 bits per key (using the RecSplit algorithm), this uses about 429 MB. (Here, 1 MB is 2^20 bytes, 1 GB is 2^30 bytes.) For each user, allocate one bit as a marker, about 238 MB. So memory usage is around 667 MB. Then read the posts, for each user calculate the hash, and set the related bit if needed. Read the user table again, calculate the hash, check if the bit is set.

Generation

Generating the MPHF is a bit tricky, not because it is slow (this may take around 30 minutes of CPU time), but due to memory usage. With 1 GB or RAM, it needs to be done in segments. Let's say we use 32 segments of about the same size, as follows:

Loop segmentId from 0 to 31.
For each user, calculate the hash code, modulo 32 (or bitwise and 31).
If this doesn't match the current segmentId, ignore it.
Calculate a 64 bit hash code (using a second hash function), and add that to the list.
Do this until all users are read.
A segment will contain about 62.5 million keys (2 billion divided by 32), that is 238 MB.
Sort this list by key (in place) to detect duplicates. With 64 bit entries, the probability of duplicates is very low, but if there are any, use a different hash function and try again (you need to store which hash function was used).
Now calculate the MPHF for this segment. The RecSplit algorithm is the fastest I know. The CHD algorithm can be used as well, but needs more space / is slower to generate.
Repeat until all segments are processed.

The above algorithm reads the user list 32 times. This could be reduced to about 10 if more segments are used (for example one million), and as many segments are read, per step, as fits in memory. With smaller segments, less bits per key are needed to the reduced probability of duplicates within one segment.

Jim Mischel · Answer 4 · 2017-06-16T23:13:59.573

The simplest solution I can think of is an old-fashioned batch update program. It takes a few steps, but in concept it's no more complicated than merging two lists that are in memory. This is the kind of thing we did decades ago in bank data processing.

Sort the file of user names by name. You can do this easily enough with the Gnu sort utility, or any other program that will sort files larger than what will fit in memory.
Write a query to return the posts, in order by user name. I would hope that there's a way to get these as a stream.
Now you have two streams, both in alphabetic order by user name. All you have to do is a simple merge:

Here's the general idea:

currentUser = get first user name from users file
currentPost = get first post from database stream
usedEmoji = false
while (not at end of users file and not at end of database stream)
{
    if currentUser == currentPostUser
    {
        if currentPost has emoji
        {
            usedEmoji = true
        }
        currentPost = get next post from database
    }
    else if currentUser > currentPostUser
    {
        // No user for this post. Get next post.
        currentPost = get next post from database
        usedEmoji = false
    }
    else
    {
        // Current user is less than post user name.
        // So we have to switch users.
        if (usedEmoji == false)
        {
            // No post by this user contained an emoji
            output currentUser name
        }
        currentUser = get next user name from file
    }
}
// at the end of one of the files.
// Clean up.

// if we reached the end of the posts, but there are still users left,
// then output each user name.
// The usedEmoji test is in there strictly for the first time through,
// because the current user when the above loop ended might have had
// a post with an emoji.
while not at end of user file
{
    if (usedEmoji == false)
    {
        output currentUser name
    }
    currentUser = get next user name from file
    usedEmoji = false
}

// at this point, names of all the users who haven't
// used an emoji in a post have been written to the output.

An alternative implementation, if obtaining the list of posts as described in #2 is overly burdensome, would be to scan the list of posts in their natural order and output the user name from any post that contains an emoji. Then, sort the resulting file and remove duplicates. You can then proceed with a merge similar to the one described above, but you don't have to explicitly check if post has an emoji. Basically, if a name appears in both files, then you don't output it.

Both solutions write to disk, which is not necessary (not "allowed" in the original problem description: all in RAM). The alternative solution you suggest is exactly what we do for Apache Jackrabbit DataStore garbage collection, it works very well. — Thomas Mueller, Jun 17 '17 at 07:21

Garbage collection with a very large dictionary

4 Answers4

Minimal perfect hashing

Generation