0

how to write billions of data into a trie with less memory

I want to extract some infomation from news like company names,so I write billions of company names into a trie,but it needs much memory and throw out of memory exception,I don't know how to solve it,so anyone can help,thanks in advance.

    public class Node
    {
        public char Value { get; set; }

        public List<Node> Children { get; set; }

        public int Depth { get; set; }

        public string Code { get; set; }

        public bool Terminal { get; set; }

        public Node(char value, int depth)
        {
            Value = value;
            Depth = depth;
            Children = new List<Node>();
        }


        public Node FindChildNode(char c)
        {
            foreach (var child in Children)
                if (child.Value == c)
                    return child;

            return null;
        }


    }

    public class Trie
    {
        private  Node _root;

        public Trie()
        {
            _root = new Node('^',0);
        }

        public Node Prefix(string s)
        {
            var currentNode = _root;
            var result = currentNode;

            foreach (var c in s)
            {
                currentNode = currentNode.FindChildNode(c);
                if (currentNode == null)
                    break;
                result = currentNode;
            }

            return result;
        }



        public void Insert(string randomLength,string code)
        {
            var commonPrefix = Prefix(randomLength);
            var current = commonPrefix;

            for (var i = current.Depth; i < s.Length; i++)
            {
               var newNode = new Node(s[i], current.Depth + 1);
                if (i+1==s.Length)
                {
                    newNode.Terminal = true;
                    newNode.Code = code;
                }
                current.Children.Add(newNode);
                current = newNode;
            }

        }



    }

Trie t=new Trie();
t.Insert("C","ABCG00DFD"); The aboved statement run 1000000000 Loops and the "C" can be replaced with different string with different length,as the loops increasing,it throw out of memory exception,so how to avoid or change it?

Jack
  • 23
  • 1
  • 8
  • Your code doesn't compile - `The name 's' does not exist in the current context`. – Enigmativity Jun 27 '19 at 10:08
  • Are you building and running as 64 bit or 32 bit? If it's 32 bit, there's your problem. – Sean Reid Jun 27 '19 at 10:10
  • 1
    *"how to write billions of data into a trie with less memory"* - erm, what? To insert billions of data you need X * billions of bytes, where X depends on data type. You can find a hash and hashing function to actually present huge amount of data cleverly, but if you want to change single byte/bit - you will still need that whole amount of bytes. As for `OutOfMemoryException` you attempting to allocate too big array somewhere. – Sinatr Jun 27 '19 at 10:15
  • 2
    Why would you even want this many items to be in memory anyway? – DavidG Jun 27 '19 at 10:17
  • Here's a minimal `Trie` definition: `public class Trie : List> { public T Value { get; set; } public bool Terminal { get; set; } }`. – Enigmativity Jun 27 '19 at 10:18
  • Or even better: `public class Trie : Dictionary { }`. – Enigmativity Jun 27 '19 at 10:25
  • 1. sorry, I have changed the code; 2. I build and run the code as 64bit,but the exe is 32bit,so what the problem; 3. Does any memory compression algorith? 4. why using the minimal Trie? – Jack Jun 27 '19 at 10:29
  • @Jack - Apart from the `Prefix` and the `Depth` my minimal trie (in my answer) does everything your does. That's why you use minimal. – Enigmativity Jun 27 '19 at 11:00
  • You can look at the DAWG data structure, I presented it briefly in [this post](https://stackoverflow.com/questions/56627928/reducing-the-size-of-a-trie-of-all-english-words) – m.raynal Jun 27 '19 at 12:07
  • @m.raynal,it is easy to understand the theory of DAWG data structure,as you say,it hard to code – Jack Jun 27 '19 at 12:48
  • @Jack definitely, that's one of its major drawbacks. But with the amount of data you're mentioning, it can be worthy to use it, especially if your entries have a tendency to share similar suffixes. That's where the DAWG really can make a difference. – m.raynal Jun 27 '19 at 13:20

1 Answers1

0

Have a go at this Trie and see if you can get it to work for what you need:

public class Trie : Dictionary<char, Trie>
{
    public void Add(string value)
    {
        var c = String.IsNullOrEmpty(value) ? '\0' : value[0];
        if (!this.ContainsKey(c))
        {
            this[c] = new Trie();
        }
        if (c != '\0')
        {
            this[c].Add(value.Substring(1));
        }
    }
}
Enigmativity
  • 113,464
  • 11
  • 89
  • 172
  • Fisrt of all, thank you .It doesn't help.Actually,dictionary still has relations with arrays,so it costs much memory with the data increasing. – Jack Jun 27 '19 at 12:40
  • @Jack - You're never going to get a trie that is terribly memory efficient. Each `char` uses at least two bytes and each reference to an object uses 4. It's not an memory efficient data structure. But for speed of determining if something is a member it is second-to-none. – Enigmativity Jun 27 '19 at 12:48