14

I've implemented a basic search for a research project. I'm trying to make the search more efficient by building a suffix tree. I'm interested in a C# implementation of the Ukkonen algorith. I don't want to waste time rolling my own if such implementation exists.

Goran
  • 6,798
  • 9
  • 41
  • 57
  • Can you elaborate on your question at all? – Matt Hanson Oct 11 '08 at 00:09
  • I am trying to implement a search within a research project. I've implemented the reverse index and incremental population of the index. Next I was looking to make the search even more efficient but did not want to roll my own ST implementation if one exists. – Goran Oct 11 '08 at 07:09

3 Answers3

12

Hard question. Here's the closest to match I could find: http://www.codeproject.com/KB/recipes/ahocorasick.aspx, which is an implementation of the Aho-Corasick string matching algorithm. Now, the algorithm uses a suffix-tree-like structure per: http://en.wikipedia.org/wiki/Aho-Corasick_algorithm

Now, if you want a prefix tree, this article claims to have an implementation for you: http://www.codeproject.com/KB/recipes/prefixtree.aspx

<HUMOR> Now that I did your homework, how about you mow my lawn. (Reference: http://flyingmoose.org/tolksarc/homework.htm) </HUMOR>

Edit: I found a C# suffix tree implementation that was a port of a C++ one posted on a blog: http://code.google.com/p/csharsuffixtree/source/browse/#svn/trunk/suffixtree

Edit: There is a new project at Codeplex that is focused on suffix trees: http://suffixtree.codeplex.com/

torial
  • 13,085
  • 9
  • 62
  • 89
5

Hei, just finished implementing .NET (c#) library containing different trie implementations. Among them:

  • Classical trie
  • Patricia trie
  • Suffix trie
  • A trie using Ukkonen's algorithm

I tried to make source code easy readable. Usage is also very straight forward:

using Gma.DataStructures.StringSearch;

...

var trie = new UkkonenTrie<int>(3);
//var trie = new SuffixTrie<int>(3);

trie.Add("hello", 1);
trie.Add("world", 2);
trie.Add("hell", 3);

var result = trie.Retrieve("hel");

The library is well tested and also published as TrieNet NuGet package.

See github.com/gmamaladze/trienet

George Mamaladze
  • 7,593
  • 2
  • 36
  • 52
3

Here is an implementation of a suffix tree that is reasonably efficient. I haven't studied Ukkonen's implementation, but the running time of this algorithm I believe is quite reasonable, approximately O(N Log N). Note the number of internal nodes in the tree created is equal to the number of letters in the parent string.

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
using NUnit.Framework;

namespace FunStuff
{
    public class SuffixTree
    {
        public class Node
        {
            public int Index = -1;
            public Dictionary<char, Node> Children = new Dictionary<char, Node>();
        }

        public Node Root = new Node();
        public String Text;

        public void InsertSuffix(string s, int from)
        {             
            var cur = Root;
            for (int i = from; i < s.Length; ++i)
            {
                var c = s[i];
                if (!cur.Children.ContainsKey(c))
                {
                    var n = new Node() {Index = from};
                    cur.Children.Add(c, n);

                    // Very slow assertion. 
                    Debug.Assert(Find(s.Substring(from)).Any());

                    return;
                }
                cur = cur.Children[c];
            }
            Debug.Assert(false, "It should never be possible to arrive at this case");
            throw new Exception("Suffix tree corruption");
        }

        private static IEnumerable<Node> VisitTree(Node n)
        {
            foreach (var n1 in n.Children.Values)
                foreach (var n2 in VisitTree(n1))
                    yield return n2;
            yield return n;
        }

        public IEnumerable<int> Find(string s)
        {
            var n = FindNode(s);
            if (n == null) yield break;
            foreach (var n2 in VisitTree(n))
                yield return n2.Index;
        }

        private Node FindNode(string s)
        {
            var cur = Root;
            for (int i = 0; i < s.Length; ++i)
            {
                var c = s[i];
                if (!cur.Children.ContainsKey(c))
                {
                    // We are at a leaf-node.
                    // What we do here is check to see if the rest of the string is at this location. 
                    for (var j=i; j < s.Length; ++j)
                        if (cur.Index + j >= Text.Length || Text[cur.Index + j] != s[j])
                            return null;
                    return cur;
                }
                cur = cur.Children[c];
            }
            return cur;
        }

        public SuffixTree(string s)
        {
            Text = s;
            for (var i = s.Length - 1; i >= 0; --i)
                InsertSuffix(s, i);
            Debug.Assert(VisitTree(Root).Count() - 1 == s.Length);
        }
    }

    [TestFixture]
    public class TestSuffixTree
    {
        [Test]
        public void TestBasics()
        {
            var s = "banana";
            var t = new SuffixTree(s);
            var results = t.Find("an").ToArray();
            Assert.AreEqual(2, results.Length);
            Assert.AreEqual(1, results[0]);
            Assert.AreEqual(3, results[1]);
        }
    } 
}
cdiggins
  • 17,602
  • 7
  • 105
  • 102
  • -@cdiggins, sorry for my ignorance. It is my first time to see a class within another class. In your code,`public class Node` is within `public class SuffixTree`, what is the skill herein? –  Jul 01 '14 at 15:56
  • 1
    I know this post is old but there seems to be an issue with this code under some circuimstances that I cant seem to figure. Set S = "TAGGAAATTC" and try t.Find("CTATT") and the line f (Text[cur.Index + j] != s[j]) will result in IndexOutsideBoundsOfArray exception. Your implementation seems to be very elegant and efficient enough (typing out Ukkoken is a bit.. difficult). I can't seem to find a way around this bug. – Ilhan Aug 15 '18 at 21:03
  • 1
    @Ilhan I just made a change that should fix the bug. – cdiggins Aug 28 '18 at 16:15
  • Search doesn't work properly in this implementation for cases, when there are 2 matches. – Anton Jun 19 '20 at 11:22