When do we actually use a Trie?

Question

I am starting to read about Trie. I got also references from friends here in: Tutorials on Trie

I am not clear on the following:
It seems that to go on and use a Trie one assumes that all the input strings that will be the search space and used to build the Trie are separated in distinct word boundaries.
E.g. all the example tutorials I have seen use input such as:

S={ball, bid, byte, car, cat, mac, map etc...}

Then we build the trie from S and do our searches (really fast)
My question is: How did we end up with S to begin with?
I mean before starting to read about tries I imagined that S would be an arbitrarily long text e.g. A Shakespeare passage.

Then using a Trie we could find things really fast.
But it seems this is not the case.

Is the assumption here that the input passage (of Shakespeare for example) is pre-processed first extracting all the words to get S?

So if one wants to search for patterns (same way as you do when you Google and see all pages having also spaces in your search query) a Trie is not appropriate?
When can we know if a Trie is the data structure that we can actually use?

score 14 · Accepted Answer · answered May 22 '12 at 07:48

14

Tries are useful where you have a fixed dictionary you want to look up quickly. Compared to a hashtable it may require less storage for a large dictionary but may well take longer to look up. One example place I have used it is for mapping URLs to operations on a web server were there may be inheritance of functionality based on the prefix. Here recursing down a trie enables appropriate lookup of all of the methods that need to be called for a particular url. It would also be efficient for storing a dictionary.

For doing text searches you would typically represent documents using a token vector of leximes with weights (perhaps based on occurance frequency), and then search against that to get a ranking of documents against a particular search vector. There a number of standard libraries to do this which I would suggest using rather than writing your own - particularly for removing stopwords, dealing with synonyms and stemming.

answered May 22 '12 at 07:48

JARW

324
2
5

1

Why does it take more space than a `HashTable`? Using a `HashTable` I would have to store `ababa` and `abab` and `aba` and `ab` and `a` as separate tokens strings while with a `Trie` I would just store `ababa`. So why do you say that it takes more space than a `HashTable`? – Jim May 22 '12 at 10:50
1

@Jim I don't think a Trie would take more space than a HashTable. With one exception of a Trie made of words with different first characters which is highly unlikely. e.g. S = { ant, ball, cat }. I have some additional space/time stats on Trie/HashMap data structures here: http://code.google.com/p/java-algorithms-implementation/ – Justin May 22 '12 at 12:59
@Jim: I think you misread: "Compared to a hashtable it [the trie] may require less storage". A trie never takes up more space than a Hashtable in theoretical terms (they have both O(n) space usage in the worst case). However the constant is much larger for a trie, because of the links between nodes, which take up additional space. So in practice a trie may take up more or less space, be faster or slower (traversal also needs time) than a hashtable. This depends a lot on your dataset. – LiKao May 22 '12 at 12:59
@LiKao:`they have both O(n) space usage in the worst case`. I don't understand this. Is the example I gave in my previous note about needing to store `abab` and `aba` and `ab` and `a` in the `HashTable` wrong? Because for a `Trie` we only need to store `abab` – Jim May 22 '12 at 13:26
@Jim: I am confused about your example. If you are storing `abab`, `aba` and `ab` in the `HashTable` you are obviously storing different data than if you only store `abab` in a `trie`. `HashTable`s and `Trie`s both have common use cases as well as different use cases. The typical comparison works only, when you use the two data structures for the same purpose. The purpose I was assuming were flat dictionaries, without substring matching. In that case both have a O(n) worst case. If you compare them for substring matching, this is a different beast altogether. – LiKao May 22 '12 at 13:43
@LiKao:Yes I was thinking about substring matching and not use the trie as a key-value map. – Jim May 22 '12 at 14:15
@Jim: In that case, just storing `abab` in the trie will not be enough. If you store `abab` you will only be able to match prefixes, not any kind of substring. If you want any kind of substring you need to store all suffixes. In the HashTable on the other hand you need to store `abab`, `aba`, `ab` and `a` to be able to do prefix matching and much more to be able to search arbitrary substrings. In that case the space complexity of the trie is much better, even in worst case scenarios. – LiKao May 22 '12 at 14:23

score 3 · Answer 2 · edited May 23 '17 at 12:08

As the other examples have said, a trie is useful because it provides fast string look-ups (or, more generally, look-ups for any sequence). Some examples of where I've used tries:

My answer to this question uses a (slightly modified) trie for matching sentences: it is a trie based on a sequence of words, rather than a sequence of characters. (The other answers to that question probably demonstrate the trie in action more clearly.)
I've also used a trie in a game which had a large number of rooms with names (the total number and the names were defined at run time), each of these names has to be unique and one had to be able to search for a room with a given name quickly. A hash table could also have been used, but in some ways a trie is simpler to implement and faster when using strings. (My trie implementation ended up being ~50 lines of C.)

The trie tag probably has many more examples.

score 2 · Answer 3 · edited May 23 '17 at 12:24

2

We can use tries for sub string searching in linear time, without pre processing the string every time. You can get a best tutorial on suffix tree generation @ Ukkonen's suffix tree algorithm in plain English?

edited May 23 '17 at 12:24

Community

1
1

answered May 22 '12 at 07:39

Abhishek

516
1
6
18

1

For clarification: Linear in the length of the string, not in the size of the dictionary. This is an important distinction, and the sole reason to use a trie. – Emil Vikström May 22 '12 at 07:41

score 2 · Answer 4 · answered May 22 '12 at 12:45

There are multiple ways to use tries. The typical example is a lookup such as the one you have presented. However Tries can also be used to fully index a complete text. Either you use the Ukkonen suffix tree algorithm, to produce a suffix trie, or you explicetly construct the suffix trie by storing suffixes (much slower than Ukkonens algorithm, but also much simpler). As this is preprocessing, which needs to be done only once speed is not that crucial.

For this you would just take your text, insert the full text, then chop of the first letter, insert the resulting text, chop of second letter, insert...

So if we have the text "The Text" we would insert the following set:

{"The Text", "he Text", "e Text", " Text", "Text", "ext", "xt", "t"}

In the resulting suffix trie we can easily search for any kind of prefix. Also this is space efficient, because we do not need to store the whole string, since common prefixes are stored only once.

If you need to store much longer strings space efficiently it is best not only to store prefixes together but also suffixes. In that case you could build up a directed acyclic word graph (DAWG), which is very similar to a trie in conception.

So a trie in that sense allows finding arbitrary substrings, including partial words. If you are only interested in storing words, a different data structure should be used, for example a inverted list (if order is important) or a vector space based retrieval algorithm (in case word order does not matter).

When do we actually use a Trie?

4 Answers4