2

I built a trie for a dictionary lookup class. It seems to work fine except the trie is quite quite large. Seems to be about 80 MB, and from what I have read it should only be 5 MB big. I am not sure what is making the trie balloon up to 80 MB, once its loaded it runs incredible fast though.

Trie Class

public class Trie {


private TrieNode root = new TrieNode();
public const int ASCIIA = 97;

public TrieNode Insert(string word) {

    char[] charArray = word.ToLower().ToCharArray();
    TrieNode node = root;

    foreach (char character in charArray) {
        node = Insert(character, node);

    }

    node.IsEnd = true;
    return root;
}

private TrieNode Insert(char character, TrieNode node) {
    if (node.Contains(character)) {
        return node.GetChild(character);
    } else {
        int number = System.Convert.ToByte(character) - TrieNode.ASCIIA;
        TrieNode treeNode = new TrieNode();
        node.nodes[number] = treeNode;
        treeNode.Value = number;
        return treeNode;
    }

}

TrieNode Class:

public class TrieNode {

public TrieNode[] nodes;
public bool IsEnd {get; set;}
public int Value {get; set;}
public const int ASCIIA = 97;
public const int ENGL = 26;

public TrieNode() {
    nodes = new TrieNode[ENGL]; 
}

public bool Contains(char character) {
    if (character == 0) 
        return false;

    int number = System.Convert.ToByte(character) - ASCIIA;

    if (number > ENGL)
        return false;

    return (nodes[number] != null);
}


public bool Contains(int character) {

    if (character == 0) 
        return false;

    return (nodes[character] != null);
}

public TrieNode GetChild(char character) {
    int number = System.Convert.ToByte(character) - ASCIIA;
    return nodes[number];
}

public TrieNode GetChild(int character) {
    return nodes[character];
}

And then to Gen the trie, using a dictionary of 170,000 words:

    string[] lines = fileTXT.Split("\n"[0]);
for (int i = 0; i < data.Length;i++) {
        trieDict.Insert(data[i]);
}
nvoigt
  • 75,013
  • 26
  • 93
  • 142
naradlov
  • 107
  • 1
  • 9
  • How are you determining that it's taking up 80MB? If you're looking in TaskManager, it could just be how much the process is using (and hasn't bothered to garbage collect), but not how much the Trie is using. – Gjeltema Jun 11 '13 at 01:02
  • I was using a profiler, but I'll admit I'm not 100% on the workings of those things. Is there a way I can accurately determine how much space it is using? I built multiple Tries and they always seem to take up 80 mb and never GC – naradlov Jun 11 '13 at 01:08
  • 1
    Your Trie is not **OPTIMIZED**...so yes, it will take up an incredible amount of memory. You can reduce memory usage by finding suffixes that are repeated across words and then only keeping one copy that all the prefixes point to. [Here](http://stevehanov.ca/blog/index.php?id=115)'s the basic idea. – Idle_Mind Jun 11 '13 at 05:19

3 Answers3

2
  1. Problem is you are using Child Node array of 26 items. Most of them are empty. On an average, each node will require 26*4 or 26*8 bytes based on 32 bit or 64 bit machines.
  2. You are initializing Child nodes in your constructor, this means, even if your node is leaf node, you are still allocating 26*BYTES which is totally useless. You only allocate array if you need to store children. Leaf nodes in TRIE do not need child array.
  3. To reduce size further, you can simply use bit wise Trie, Which will only need two nodes, however, it increases computation time and reduces performance by very little factor. CPUs use bit wise trie to identify Machine Instructions to execute.
  4. You can use Dictionary instead of array, which will not allocate all 26 letters, as mentioned in this answer How to create a trie in c#. And also you can reduce default capacity.
Community
  • 1
  • 1
Akash Kava
  • 39,066
  • 20
  • 121
  • 167
0

One thing you can do is make TrieNode into a struct and then avoid modifying it after initialization... However you may also want to do a Memory Dump and inspect the memory since it may not be taking up as much space as you think... The memory reported in the Task Manager for the process is not the memory used by your app but the memory reserved for your app by the .NET runtime.

Haney
  • 32,775
  • 8
  • 59
  • 68
0

I faced this exact same problem when creating a trie from a big dictionary. So I constructed a DAWG (Directed Acyclic Word Graph) out of those words, which takes up a very small amount of space (even less than my word dictionary), retaining the same performance as the trie, maybe even faster. It works by identifying common suffixes and prefixes in the words and making an finite automaton out of those. If your dictionary is static, you can create the DAWG and persist it to the disk, and you can load it easily in your application (it is implemented using integer arrays). Here is an implementation.

max
  • 4,248
  • 2
  • 25
  • 38