3

I had some questions about usage of Tries/SortedSets for a dictionary.

  1. Which is more efficient for lookups?
  2. Which is more efficient for virtual memory?
  3. Are there any other advantages/disadvantages of either structure when utilized for a dictionary?

No need to answer all three, just looking for some good responses and source material if you have any. Thanks.

PandaBearSoup
  • 699
  • 3
  • 9
  • 20
  • 1
    Maybe [this SO article *How Do I Choose Between a Hash Table and a Trie (Prefix Tree)?*](http://stackoverflow.com/questions/245878/how-do-i-choose-between-a-hash-table-and-a-trie-prefix-tree) could help? – keenthinker Jul 22 '13 at 21:11

1 Answers1

0
  1. Lookups in a Trie are blazing fast, as they just require O(length of key) comparisons, and are almost as fast as it's possible to be. A SortedSet is generally implemented using balanced binary search trees, which would perform many more comparisons, in the worst case O(height of tree) string comparisons. So the Trie is the clear winner here.

  2. Virtual memory efficiency can be seen as how fast the data structure can be loaded into memory. SortedSet takes up space proportional to the number of elements. It's implemented using pointers, which can be bad for the loading efficiency. That can be improved by serializing it and storing it in an array, but that increases the space needed. A Trie in its most simple form takes a lot of memory. It's also implemented using pointers, which is again bad for loading efficiency. Even if serialized, it takes a large amount of memory. But there are interesting alternatives here, which compress the trie and give the same performance. Radix Tries take significantly less amount of memory. Even better, a DAWG (directed acyclid word graph) overlaps common suffixes and prefixes and compresses the dictionary by a huge amount. After compression, the DAWG could take less space than your dictionary itself. It is implemented using an array, so it's fast to load too. At the end, if you have a static dictionary, DAWG would be the best way to go, otherwise it depends.

  3. A trie sees keys as sequences. It is a prefix tree. You can get all words starting from a prefix very fast. Using a trie, you can perform auto completion and auto correction efficiently. Some keys like floating point numbers, could lead to long chains in the trie, which is bad. A SortedSet sees keys as comparable items. So it is easily possible to partition the elements. Both SortedSet and Trie can offer the keys in alphabetical order, but I guess SortedSet would be much faster.

max
  • 4,248
  • 2
  • 25
  • 38
  • One callout: On number 1, "So the Trie is the clear winner here." From what I've found, lookup efficiency on Sorted Sets is O(log(n)). So, for a search term like "dinosaur" (8 characters), you dictionary would have to have > 100 million (10^8) words for a trie to be more efficient. – emilyk Feb 22 '17 at 01:43