If you expect low prefix sharing, then a trie may not be your best option.
Since you only load the lookup table once, at startup, and your goal is low memory footprint with "reasonable speed" for lookup, your best option is likely a sorted array and binary search for lookup.
First, you load the data into an array. Since you likely don't know the size up front, you load into an ArrayList
. You then extract the final array from the list.
Assuming you load 50 million 10 character strings, memory will be:
10 character string:
String: 12 byte header + 4 byte 'hash' + 4 byte 'value' ref = 24 bytes (aligned)
char[]: 12 byte header + 4 byte 'length' + 10 * 2 byte 'char' = 40 bytes (aligned)
Total: 24 + 40 = 64 bytes
Array of 50 million 10 character strings:
String[]: 12 byte header + 4 byte 'length' + 50,000,000 * 4 byte 'String' ref = 200,000,016 bytes
Values: 50,000,000 * 64 bytes = 3,200,000,000 bytes
Total: 200,000,016 + 3,200,000,000 = 3,400,000,016 bytes = 3.2 GB
You will need another copy of the String[]
when you convert the ArrayList<String>
to String[]
. The Arrays.sort()
operation may need 50% array size (~100,000,000 bytes) for temporary storage, but if ArrayList
is released for GC before you sort, that space can be reused.
So, total requirement is ~3.5 GB, just for the symbol table.
Now, if space is truly at a premium, you can squeeze that. As you can see, the String
itself adds an overhead of 24 bytes, out of the 64 bytes. You can make the symbol table use char[]
directly.
Also, if your strings are all US-ASCII
or ISO-8859-1
, you can convert the char[]
to a byte[]
, saving half the bytes.
Combined, that reduces the value size from 64 bytes to 32 bytes, and the total symbol table size from 3.2 GB to 1.8 GB, or roughly 2 GB during loading.
UPDATE
Assuming input list of strings are already sorted, below is example of how you do this. As an MCVE, it just uses a small static array as input, but you can easily read them from a file instead.
public class Test {
public static void main(String[] args) {
String[] wordsFromFile = { "appear", "attack", "cellar", "copper",
"erratic", "grotesque", "guitar", "guttural",
"kittens", "mean", "suit", "trick" };
List<byte[]> wordList = new ArrayList<>();
for (String word : wordsFromFile) // Simulating read from file
wordList.add(word.getBytes(StandardCharsets.US_ASCII));
byte[][] symbolTable = wordList.toArray(new byte[wordList.size()][]);
test(symbolTable, "abc");
test(symbolTable, "attack");
test(symbolTable, "car");
test(symbolTable, "kittens");
test(symbolTable, "xyz");
}
private static void test(byte[][] symbolTable, String word) {
int idx = Arrays.binarySearch(symbolTable,
word.getBytes(StandardCharsets.US_ASCII),
Test::compare);
if (idx < 0)
System.out.println("Not found: " + word);
else
System.out.println("Found : " + word);
}
private static int compare(byte[] w1, byte[] w2) {
for (int i = 0, cmp; i < w1.length && i < w2.length; i++)
if ((cmp = Byte.compare(w1[i], w2[i])) != 0)
return cmp;
return Integer.compare(w1.length, w2.length);
}
}
Output
Not found: abc
Found : attack
Not found: car
Found : kittens
Not found: xyz