10

One of my favourite data structures in college was the Trie. It's a great data structure for holding a large set of strings if the prefixes are shared. Lookups are also nice, since they are done at O(|length|) of the string regardless of how many strings are in the set.

By comparison, a balanced tree would be O(log N) in the number of set items, plus whatever you pay for comparisons. A hash table would involve the hash calculation, comparison, etc.

It is therefore surprising to me that there is no Trie implementation in the standard library of most languages.

The only reason I could come up with was the possibility that memory access costs make it too expensive. Rather than investigating O(log N) locations if doing a tree lookup, you are not doing O(|length|) different locations, with all the consequences. If the strings are long, this could end up being way too much.

So I'm wondering: how much is what I just described a concern? What do you do when you need to store a large set or map of strings?

Community
  • 1
  • 1
Uri
  • 88,451
  • 51
  • 221
  • 321
  • " in the standard library of most functions." did you mean " in the standard library of most _languages_?" – Assaf Lavie Mar 09 '09 at 00:09
  • If you're going to post closely related questions (http://stackoverflow.com/questions/623892/where-do-i-find-a-standard-trie-based-map-implementation-in-java) you should link them together. – Matthew Flaschen Mar 09 '09 at 00:21
  • I was going to, and then put the Wikipedia link for Trie instead... Anyway, now you put the link, so we're good. – Uri Mar 09 '09 at 00:48

5 Answers5

7

I hadn't thought of this as an area of concern before, but now that you mention it, there are times when a standard Trie implementation might be handy. On the other hand, as far as I know, Tries are used by Python and Perl and other string-savvy languages that I use now.

Last I checked, which was ages ago, the BSD kernel code used Tries (Patricia Tries) in the code to select the best interface for sending packets. Looks like Wikipedia has some info.

Craig S
  • 952
  • 4
  • 7
4

You could just build two sample apps and see which one performs better. Memory access is cheap assuming you don't page fault. Then it's very expensive. For client application development, its almost always better to process than to access memory for this very reason. Modern processors are ridiculously fast, but cache misses still hurt.

i_am_jorf
  • 53,608
  • 15
  • 131
  • 222
  • I used to work at Intel for several years, so I'm extremely paranoid even about falling out of the same cache line. Besides, if each node is located elsewhere on the heap, unless my garbage collector is reorganizing stuff I may very well page fault. – Uri Mar 09 '09 at 00:15
  • 1
    Good luck writing any search algorithm that stays in the same cache line! – Andrew Grant Mar 09 '09 at 00:19
  • I've actually once did that with fixed graphs for fun (think subway maps) and got an impressive speedup... :) – Uri Mar 09 '09 at 05:15
2

I did some performance testing in C# with a Trie and a Dictionary (a strongly typed hash table). I found that the Dictionary was 5-10 times faster than the Trie. Perhaps my implementation of the Trie could be optimized a bit, but hardly enough to be much faster than (or perhaps even as fast as) the Dictionary.

The ContainsKey method in the dictionary is close to an O(1) operation (depending on how good the hashing algorithm is), so it's not easy to make a collection that beats that as long as the hashing algorithm is reasonably fast.

With a custom IEqualityComparer you can use most anything as a key in a Dictionary, which makes it rather flexible. A Trie is a bit more limited in what you can use as key, so that limits the usefulness somewhat.

Guffa
  • 687,336
  • 108
  • 737
  • 1,005
  • 3
    Of course the hash has faster access. The advantage of a trie over a hash is memory efficiency. – Frank Aug 21 '09 at 02:06
  • @Frank Dictionary: store each string with little overhead. Trie: store common letters once but with an overhead of allocating an object with at least two pointers (left neighboor, leftmost child). Conclusion: In C# at least, dictionary is much much more efficient in terms of space. I speak from a sad sad experience. – Ali Ferhat Jan 20 '12 at 22:14
0

Tries are particularly useful for ip address and subnet lookup, and always will be on any architecture. You can find a trie implementation for addresses in The open-source IPAddress Java library. Disclaimer: I am the project manager of that library.

Tries organize by address prefixes, as in this example:

● 0.0.0.0/0 (10)
└─○ 0.0.0.0/1 (9)
  ├─○ 8.0.0.0/6 (8)
  │ ├─● 8.9.8.0/24 (7)
  │ │ └─○ 8.9.8.0/28 (6)
  │ │   ├─● 8.9.8.0/29 (1)
  │ │   └─● 8.9.8.8/29 (5)
  │ │     ├─○ 8.9.8.8/30 (2)
  │ │     │ ├─● 8.9.8.9 (1)
  │ │     │ └─● 8.9.8.10 (1)
  │ │     └─● 8.9.8.12/30 (2)
  │ │       └─● 8.9.8.12/31 (1)
  │ └─● 10.0.2.15 (1)
  └─● 127.0.0.1 (1)

The structure allows for efficient lookup, routing and containment checks.

Sean F
  • 4,344
  • 16
  • 30
0

It's hard to make a standard Trie class broadly useful enough to justify putting one in a standard library.

Of course it would require anything you put in it to implement a common string-like interface.

And since the Trie doesn't actually store the strings, iterating through it is either slow, because it has to reconstitute all the elements, or awkward, because you don't actually get the strings.

And, you know, average_length and max_length are both at least O(log N). If you know the elements are strings, then you can search in a BST in O(length + log N), which is pretty much the same as O(length), and you retain the advantage of actually having the strings in the structure.

Really the only thing that Tries win for is memory-efficient storage of strings with long common prefixes. It happens, but it's not that common. I guess none of the languages have deemed it to be worth the trouble include one.

Matt Timmermans
  • 53,709
  • 3
  • 46
  • 87