1

I'm attempting to construct a suffix trie, and due to strict requirements it must be indexed in memory.

EDIT: The problem is not the tree itself, but actually the way I was reading the file.

Palindrome
  • 25
  • 6
  • My guess is that the file is too big and you don't have enough memory. Increase the max heap size and run it again. – duffymo Mar 12 '12 at 18:01
  • Please see http://stackoverflow.com/questions/9649722/how-to-refer-to-children-in-a-tree-with-millions-of-nodes and http://stackoverflow.com/questions/9670080/memory-exception-in-c-sharp – Yahia Mar 12 '12 at 18:02
  • 1
    How large is large? And how do you index the file? – H H Mar 12 '12 at 18:09
  • 1
    Either you're doing the same homework or this is [a repost](http://stackoverflow.com/questions/9670080/) under a different account. Much better question, but you should have edited the old one. – H H Mar 12 '12 at 18:12

1 Answers1

2

If you're passing the entire text file as a single string you could easily run into an out of memory exception with your first loop!

// imagine if s.Length was 100k or so
for (int i = 0; i < s.Length; i++)
{
    AddString(s.Substring(i, s.Length-i));
}

When reading the file to construct the trie, you'll need to split each line and probably normalize the characters:

string line;
while (null != (line = reader.ReadLine()))
{
    string[] parts = line.Split(' ', ',', '.', '!', '\t', '?'); // naive
    foreach (string part in parts)
    {
        if (part.Length > 0)
        {
            // make each string uppercase so as to avoid Hello and hello being
            // two trie entries
            trie.AddSuffix(part.ToUpperInvariant());
        }
    }
}

For example (on the output from dir /b c:\windows):

A
 D
  D
   I
    N
     S
  E
   D
 P
  P
   C
    O
     M
      P
       A
        T
   P
    A
     T
      C
       H
...

To appropriately handle larger files, a more compact trie structure would be desirable. I would just have unshared suffixes stored in a separate dictionary:

// If you add a character, but there is no entry in m_children
// just park the tail end of it here
Dictionary<char, string> m_tails;

You would then move the per character logic to your AddString of the SuffixNode:

public void AddString(string s)
{
    if (s.Length == 0) return;

    char c = s[0];
    if (m_children.ContainsKey(c))
    {
        if (s.Length > 1) m_children[c].AddString(s.Substring(1));
    }
    else if (m_tails.ContainsKey(c))
    {
        SuffixNode node = new SuffixNode();
        node.AddString(m_tails[c]);
        if (s.Length > 1) node.AddString(s.Substring(1));

        m_children.Add(c, node);
        m_tails.Remove(c);
    }
    else
    {
        m_tails.Add(c, s.Length > 1 ? s.Substring(1) : "");
    }
}

Now you have a much more compact version of the trie, which will greatly decrease the number of child SuffixNodes created for any given corpus. Returning to the dir /b c:\windows example, we can see a practical reduction in nodes:

A
 P
  P
   COMPAT
   PATCH
  I
 T
  I
   O
    N
     S
...

At this point your trie has a more efficient representation. You're left with determining how to deal with terminal node representations in order to ensure lookups are accurate.

user7116
  • 63,008
  • 17
  • 141
  • 172
  • Thanks for the suggestion - I tried implementing what you've suggested, however only a small bit of the text file is actually read. Any idea why? – Palindrome Mar 12 '12 at 18:09
  • I get all of the file being available in the trie and am able to read in rather large files using your code. I've yet to establish the limit however. – user7116 Mar 12 '12 at 18:16
  • One thing you should note is that your current usage of a `Dictionary` is efficient in terms of lines of code, but likely inefficient in terms of memory representation once you get into a large corpus. – user7116 Mar 12 '12 at 18:21
  • I can't understand what's going on - I'm using the code you suggested for reading the file, yet it only reads the last few sentences of the file! – Palindrome Mar 12 '12 at 18:36
  • I used your code almost verbatim (instead I use `AddString(s.Substring(i))`). Did you declare `trie` inside the while loop? – user7116 Mar 12 '12 at 18:44
  • No, trie is not declared there. It's quite strange because it **should** be working. Thanks for your answer though, this looks like it's the problem – Palindrome Mar 12 '12 at 18:51
  • Using the more compact representation I was able to process a 12MB text file (`dir /b/s c:\windows`) in a working set of ~2.7GB. Hardly efficient, but it ran to completion. – user7116 Mar 12 '12 at 19:17