What symbol table can I use to store ~50 mil strings with fast lookup without running out of heap space?

Question

I have a file of ~50 million strings that I need to add to a symbol table of some sort on startup, then search several times with reasonable speed.

I tried using a DLB trie since lookup would be relatively fast since all strings are < 10 characters, but while populating the DLB I would get either GC overhead limit exceeded or outofmemory - heap space error. The same errors were found with HashMap. This is for an assignment that would be compiled and run by a grader so I would rather not just allocate more heap space. Is there a different data structure that would have less memory usage, while still having reasonable lookup time?

Trie should be very efficient if your strings share many prefixes. What kind of strings are they? Are you sure about your tries implementation? — Jean-Baptiste Yunès, Sep 26 '16 at 18:37
IRL, the best answer would probably be: http://en.wikipedia.org/wiki/BigTable — ControlAltDel, Sep 26 '16 at 18:40
@Jean-BaptisteYunès It's a list of all possible passwords, so a lot of not much pattern to the prefixes.. maybe this makes trie the wrong choice? The trie implementation worked well with a smaller dataset. — 3802400, Sep 26 '16 at 18:45
@3802400 surely you don't mean _all_ possible passwords? If that was the case, and if passwords could be formed from a string of up to 9 letters (as you said <10), then 26 to the power of 9 is _much_ more than 50 million! And if you allow all characters, then it's a much bigger number than that. — Klitos Kyriacou, Sep 26 '16 at 20:59
@KlitosKyriacou it is all possible passwords with several constraints. It was made easier for the sake of the assignment than real-life password constraints would be — 3802400, Sep 27 '16 at 01:32
If the list was build by derivating basic password list, then only store the basics and make derivation for each one with an iterator generated at each leave of your trie. What is your main constraint ? Do you really need to have all password at the same time ? Is using some kind of cache would help you (generate passwords on the fly and push them into a cache for frequent access) ? etc. I'm pretty sure that your "constraints" can guide to optimize the storage, could you tell us more ? — Jean-Baptiste Yunès, Sep 27 '16 at 05:48
I agree with Jean-Baptiste. Let us know what the constraints are. It may be possible to use the constraints to store the passwords more efficiently. Also, if the passwords are mostly ASCII characters, then you can almost halve the space used by using UTF-8, as a byte array. Another question is, if the passwords are the keys in the map, what are the values and how much space do they take? — Klitos Kyriacou, Sep 27 '16 at 06:26
@Jean-BaptisteYunès the passwords are all the same length, contain 1-3 numbers, 1-3 letters, and 1-2 symbols. They are not case sensitive. And they can't contain dictionary words. — 3802400, Sep 28 '16 at 00:55
@Klitos Kyriacou the values are times it took the passwords to be generated. The user needs to be able to search a password to get the time. — 3802400, Sep 28 '16 at 00:55
The valid characters of your passwords are so restricted that you can put two characters into each byte. (Each character can be coded in 4 bits.) This means that your passwords can fit inside a `long`. You can therefore use some kind of sparse array. Alternatively, you can use a kind of hashmap where you can guarantee that each hash is unique. — Klitos Kyriacou, Sep 28 '16 at 06:32
Looking at your password spec again, it implies that your passwords are all 6 characters long (e.g. 1 number + 3 letters + 2 symbols, or 2+3+1, or 3+1+2). This means they can all fit into an `int` with just 3 bytes of it used; i.e. with a value ranging from 0 to 2^24-1. — Klitos Kyriacou, Sep 28 '16 at 06:53
If the values (i.e. the times it took to generate each password) are in the range 0-255, then you can put the passwords into the top 3 bytes of an `int` and the times into the bottom byte. Then just put each `int` containing password+time into an `int[]` array. You can then use binary search to find the int with the top 3 bytes being the password you want, and find its time in the bottom byte. If the values are longer, then you can use a `long[]` array instead and do the same thing. — Klitos Kyriacou, Sep 28 '16 at 07:12

score 3 · Accepted Answer · edited May 23 '17 at 12:19

If you expect low prefix sharing, then a trie may not be your best option.

Since you only load the lookup table once, at startup, and your goal is low memory footprint with "reasonable speed" for lookup, your best option is likely a sorted array and binary search for lookup.

First, you load the data into an array. Since you likely don't know the size up front, you load into an ArrayList. You then extract the final array from the list.

Assuming you load 50 million 10 character strings, memory will be:

10 character string:
    String: 12 byte header + 4 byte 'hash' + 4 byte 'value' ref = 24 bytes (aligned)
    char[]: 12 byte header + 4 byte 'length' + 10 * 2 byte 'char' = 40 bytes (aligned)
    Total: 24 + 40 = 64 bytes
Array of 50 million 10 character strings:
    String[]: 12 byte header + 4 byte 'length' + 50,000,000 * 4 byte 'String' ref = 200,000,016 bytes
    Values: 50,000,000 * 64 bytes = 3,200,000,000 bytes
    Total: 200,000,016 + 3,200,000,000 = 3,400,000,016 bytes = 3.2 GB

You will need another copy of the String[] when you convert the ArrayList<String> to String[]. The Arrays.sort() operation may need 50% array size (~100,000,000 bytes) for temporary storage, but if ArrayList is released for GC before you sort, that space can be reused.

So, total requirement is ~3.5 GB, just for the symbol table.

Now, if space is truly at a premium, you can squeeze that. As you can see, the String itself adds an overhead of 24 bytes, out of the 64 bytes. You can make the symbol table use char[] directly.

Also, if your strings are all US-ASCII or ISO-8859-1, you can convert the char[] to a byte[], saving half the bytes.

Combined, that reduces the value size from 64 bytes to 32 bytes, and the total symbol table size from 3.2 GB to 1.8 GB, or roughly 2 GB during loading.

UPDATE

Assuming input list of strings are already sorted, below is example of how you do this. As an MCVE, it just uses a small static array as input, but you can easily read them from a file instead.

public class Test {
    public static void main(String[] args) {
        String[] wordsFromFile = { "appear", "attack", "cellar", "copper",
                                   "erratic", "grotesque", "guitar", "guttural",
                                   "kittens", "mean", "suit", "trick" };
        List<byte[]> wordList = new ArrayList<>();
        for (String word : wordsFromFile) // Simulating read from file
            wordList.add(word.getBytes(StandardCharsets.US_ASCII));
        byte[][] symbolTable = wordList.toArray(new byte[wordList.size()][]);

        test(symbolTable, "abc");
        test(symbolTable, "attack");
        test(symbolTable, "car");
        test(symbolTable, "kittens");
        test(symbolTable, "xyz");
    }
    private static void test(byte[][] symbolTable, String word) {
        int idx = Arrays.binarySearch(symbolTable,
                                      word.getBytes(StandardCharsets.US_ASCII),
                                      Test::compare);
        if (idx < 0)
            System.out.println("Not found: " + word);
        else
            System.out.println("Found    : " + word);
    }
    private static int compare(byte[] w1, byte[] w2) {
        for (int i = 0, cmp; i < w1.length && i < w2.length; i++)
            if ((cmp = Byte.compare(w1[i], w2[i])) != 0)
                return cmp;
        return Integer.compare(w1.length, w2.length);
    }
}

Output

Not found: abc
Found    : attack
Not found: car
Found    : kittens
Not found: xyz

I ran out of memory making an ArrayList, so I calculated the input size and tried to make a string array, but that also ran out of memory. I was able to make a byte array of the strings (which were already in sorted order) but I am having trouble thinking of how I would be able to search the byte array. — 3802400, Sep 27 '16 at 02:21

score 2 · Answer 2 · answered Sep 26 '16 at 20:14

2

Use a single char array to store all strings (sorted), and an array of integers for the offsets. String n is the chars from offset[n - 1] (inclusive) to offset[n] (exclusive). offset[-1] is zero.

Memory usage will be 1GB (50M*10*2) for the char array, and 200MB (50M * 4) for the offset array. Very compact even with two byte chars.

You will have to build this array by merging smaller sorted string arrays in order not to exceed your heap space. But once you have it, it should be reasonably fast.

Alternatively, you could try a memory optimized trie implementation such as https://github.com/rklaehn/radixtree . This uses not just prefix sharing, but also structural sharing for common suffixes, so unless your strings are completely random, it should be quite compact. See the space usage benchmark. But it is scala, not java.

answered Sep 26 '16 at 20:14

Rüdiger Klaehn

12,445
3
41
57

Can you explain any more about how I would merge smaller string arrays? And would I have to build the offset array at the same time as I merge the string arrays? – 3802400 Sep 27 '16 at 02:18
If your compact string arrays are always sorted, merging two is trivial. Just use an algorithm like http://stackoverflow.com/questions/5958169/how-to-merge-two-sorted-arrays-into-a-sorted-array and merge into a builder. You might want to use a big string instead of a char array. Then you can use a StringBuilder. – Rüdiger Klaehn Sep 27 '16 at 07:34
StringBuilder.toString() creates a copy of the string. If you are that short of space, you don't want to double the space requirements. – Klitos Kyriacou Sep 27 '16 at 18:45
@KlitosKyriacou agree. Using StringBuilder/String is a good way to get something working quickly, not the perfect solution. That said, with a memory usage of only 1G for the concatenated strings, I would not lose sleep over doing a copy. – Rüdiger Klaehn Sep 27 '16 at 18:49

What symbol table can I use to store ~50 mil strings with fast lookup without running out of heap space?

2 Answers2