Space efficient collection for strings with common prefixes - Java implementation

Question

I need to store millions of string with common prefixes (they don't correspond to file system paths) in a Set like structure in memory, and query the Collection to see if a path exists.

e.g.

/path
/path/1
/path/2
/path/1/a
/path/1/b

I want to store these as efficiently as possible (they will be in memory) , given that there will be many common prefixes for all the strings involved would a Trie be a reasonable candidate?

I'm looking for a recommendation for an implementation of a suitable data structure in Java.

Similar http://stackoverflow.com/questions/2218905/need-memory-efficient-way-to-store-tons-of-strings-was-hat-trie-implementation — Joel, Apr 09 '11 at 12:02

Manuel Salvadores · Answer 1 · 2011-04-08T13:41:37.883

4

A Trie looks like the structure you need. Also similar structures are Radix Tries, that unlike tries, use sequence of characters to label edges. In plain tries edges are labeled with single characters, I am sure they behave better in your case where strings share quite a lot of prefixes.

see also ...

http://code.google.com/p/trie/

http://code.google.com/p/radixtree/

edited Apr 08 '11 at 13:41

answered Apr 08 '11 at 13:35

Manuel Salvadores

16,287
5
37
56

The authors comment doesn't inspire much confidence! "Note to any visitors, this is fine SAMPLE code, but not production code. It was written by an inexperienced programmer (me at the time) in one evening." – Joel Apr 08 '11 at 13:38
yes, you're right ... go for Radix Tries (second link) they are actually better for this case. – Manuel Salvadores Apr 08 '11 at 13:40
Any reason the prefer the second implementation of a radix trie to this one? http://code.google.com/p/patricia-trie/ – Joel Apr 08 '11 at 13:52
But _both_ are radix tries (aka Patricia-Trie), just different implementations.... – Joel Apr 08 '11 at 13:58
no, they are different. the first one is simply 'trie' and the second one is a 'radix trie'. See the wikipedia explanations and you will easily spot the diferences. – Manuel Salvadores Apr 08 '11 at 14:03
Yes, but code.google.com/p/patricia-trie and http://code.google.com/p/radixtree/ _are_ both Radfix Tries (Patricia Tries) – Joel Apr 08 '11 at 14:32
yes, radix trie and patricia trees are the same thing. Sorry. – Manuel Salvadores Apr 08 '11 at 14:37

score 3 · Answer 2 · answered Apr 08 '11 at 13:36

3

This looks like a good candidate implementation: https://github.com/rkapsi/patricia-trie

answered Apr 08 '11 at 13:36

Emil Sit

22,894
7
53
75

Anon · Answer 3 · 2011-04-08T15:47:08.567

Let's consider the tradeoffs before any suggestions.

You say you need to store "millions" of paths. I'll assume one million, because it makes the calculations easier (and even on servers, I haven't seen more than a million directories).

How long are those paths? You've shown an example with very short paths, so we're looking at maybe a hundred megabytes to store those million paths. I don't have a reference for maximum path lengths, but 256 characters sticks in my mind. So your paths will take a maximum of 512 Mb of memory. Do you have that much memory?

How evenly distributed are the pathnames? In other words, do you follow an 80:20 rule where 80% of the paths are found in 20% of the directories? The reason that I ask is because a trie structure needs some form of index between levels. If you have a lot of directories with only a few paths under them, you're going to have a lot of overhead to maintain a trie.

The recommendations: if I had enough memory, I'd use a HashSet<String> and be done with it.

If I didn't have a lot of memory, and a directory structure that followed the 80:20 rule (or, more likely, 95:5), I'd think of a HashMap<String,Set<String>>. The key of this map would be the longest leading path string that had a "reasonable" amount of duplication, and the values would be the remaining strings. You'd probe this map with progressively shorter leading components until you found a match, then probe the set for the remainder.

That leaves open the question of "reasonable" duplication. It's the amount of duplication where the overhead of a two-piece data structure is overcome by the reduction in duplication. For example, /usr/bin/ might be valid (because it holds thousands of files and you save 9 characters or 18 bytes from each), but /usr/local/bin/ probably wouldn't be (at least on my system, it only holds a single file).

score 0 · Answer 4 · answered Apr 08 '11 at 13:36

0

You could use a Tree structure, just like it would on disk. However, you need to remember that tree structures can use as much or more memory in overhead as they save. i.e. they are not really designed to save memory.

Perhaps you could use the cache of the disk subsystem if these files exist. It could be faster.

I would check you really need to do this as you can store a million entries in a JVM quite comfortably. ;)

If you want to minimise memory consumption, you could compress the data in memory. This could be much smaller than any other option, but more complicated to make as efficient.

answered Apr 08 '11 at 13:36

Peter Lawrey

525,659
79
751
1,130

I'll possibly have tens of millions of entries, and reasonably limited resources. – Joel Apr 08 '11 at 13:38
I would consider using compression, you won't get smaller than that. – Peter Lawrey Apr 08 '11 at 13:41
I need to avoid too much processing as I'll be creating & destroying these collections frequently. – Joel Apr 08 '11 at 13:42
Something work knowing is that the latest version of Oracle's JVM can use byte[] instead of char[] for Strings. Can you use the last JVM? – Peter Lawrey Apr 08 '11 at 13:43
Also its important to have the right tools for the job. A server from major vendor with 8 GB of memory costs around £430. ;) – Peter Lawrey Apr 08 '11 at 13:53
So in summary, you can't use the latest JVM which would halve the consumption of Strings, have very restricted hardware, but want to update a large dataset with signifciant performance demands. ;) – Peter Lawrey Apr 08 '11 at 13:55
By the way, by latest version of JVM do you mean the current version 1.6, or beta version 1.7? – Joel Apr 08 '11 at 14:08
You can buy a 1 TB server for £40K (that is 1 TB of main memory ;) – Peter Lawrey Apr 08 '11 at 14:08
Java 6 update 21 or later. -XX:+UseCompressedStrings (Which is on by default) – Peter Lawrey Apr 08 '11 at 14:10
It is hard to imagine anyone needing that much, however 8 - 32 GB servers are available for a surprising low prices even compared to a year or so ago. ;) – Peter Lawrey Apr 08 '11 at 14:56

score 0 · Answer 5 · answered Apr 08 '11 at 13:38

0

What I would use:

a multi-level map that resembles the directory structure.
A balanced tree with single characters as keys and further trees as values.

answered Apr 08 '11 at 13:38

Ingo

36,037
5
53
100

score 0 · Answer 6 · answered Apr 08 '11 at 13:46

0

I would recommend you to store the paths as they are, as Strings. I believe the overhead trying to save memory will result in the reverse.

Of course, it's simple enough to test if it is, by benchmarking against the Tries data structures mentioned above.

answered Apr 08 '11 at 13:46

JenEriC

758
7
10

" I believe the overhead trying to save memory will result in the reverse." why? – Joel Apr 08 '11 at 13:50
The pointers to the children might be larger than the children themselves. – JenEriC Apr 08 '11 at 14:13

Space efficient collection for strings with common prefixes - Java implementation

6 Answers6