4

I have a need to do very quick prefix "sql like" searches over a hundreds of thousands of keys. I have tried doing performance tests using a SortedList, a Dictionary, and a SortedDictionary, which I do like so :

var dictionary = new Dictionary<string, object>();
// add a million random strings
var results = dictionary.Where(x=>x.Key.StartsWith(prefix));

I find that that they all take a long time, Dictionary is the fastest, and SortedDictionary the slowest.

Then I tried a Trie implementation from http://www.codeproject.com/Articles/640998/NET-Data-Structures-for-Prefix-String-Search-and-S which is a magnitude faster, ie. milliseconds instead of seconds.

So my question is, is there no .NET collection I can use for the said requirement? I would have assumed that this would be a common requirement.

My basic test :

    class Program
    {
        static readonly Dictionary<string, object> dictionary = new Dictionary<string, object>(); 
        static Trie<object> trie = new Trie<object>(); 

        static void Main(string[] args)
        {
            var random = new Random();
            for (var i = 0; i < 100000; i++)
            {
                var randomstring = RandomString(random, 7);
                dictionary.Add(randomstring, null);
                trie.Add(randomstring, null);
            }

            var lookups = new string[10000];
            for (var i = 0; i < lookups.Length; i++)
            {
                lookups[i] = RandomString(random, 3);
            }

            // compare searching
            var sw = new Stopwatch();
            sw.Start();
            foreach (var lookup in lookups)
            {
                var exists = dictionary.Any(k => k.Key.StartsWith(lookup));
            }
            sw.Stop();
            Console.WriteLine("dictionary.Any(k => k.Key.StartsWith(randomstring)) took : {0} ms", sw.ElapsedMilliseconds);

// test other collections

            sw.Restart();
            foreach (var lookup in lookups)
            {
                var exists = trie.Retrieve(lookup).Any();
            }
            sw.Stop();
            Console.WriteLine("trie.Retrieve(lookup) took : {0} ms", sw.ElapsedMilliseconds);

            Console.ReadKey();
        }

        public static string RandomString(Random random,int length)
        {
            const string chars = "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";

            return new string(Enumerable.Repeat(chars, length)
              .Select(s => s[random.Next(s.Length)]).ToArray());
        }
    }

Results:

dictionary.Any(k => k.Key.StartsWith(randomstring)) took : 80990 ms
trie.Retrieve(lookup) took : 115 ms
sprocket12
  • 5,368
  • 18
  • 64
  • 133

3 Answers3

0

If sorting matters, try to use a SortedList instead of SortedDictionary. They both have the same functionality but they are implemented differently. SortedList is faster when you want to enumerate the elements (and you can access the elements by index), and SortedDictionary is faster if there are a lot of elements and you want to insert a new element in the middle of the collection.

So try this:

var sortedList = new SortedList<string, object>();
// populate list...

sortedList.Keys.Any(k => k.StartsWith(lookup));

If you have a million elements, but you don't want to re-order them once the dictionary is populated, you can combine their advantages: populate a SortedDictionary with the random elements, and then create a new List<KeyValuePair<,>> or SortedList<,> from that.

György Kőszeg
  • 17,093
  • 6
  • 37
  • 65
0

If you can sort the keys once and then use them repeatedly to look up the prefixes, then you can use a binary search to speed things up.

To get the maximum performance, I shall use two arrays, once for keys and one for values, and use the overload of Array.Sort() which sorts a main and an adjunct array.

Then you can use Array.BinarySearch() to search for the nearest key which starts with a given prefix, and return the indices for those that match.

When I try it, it seems to only take around 0.003ms per check if there are one or more matching prefixes.

Here's a runnable console application to demonstrate (remember to do your timings on a RELEASE build):

using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Diagnostics;
using System.Linq;

namespace Demo
{
    class Program
    {
        public static void Main()
        {
            int count = 1000000;
            object obj = new object();

            var keys   = new string[count];
            var values = new object[count];

            for (int i = 0; i < count; ++i)
            {
                keys[i] = randomString(5, 16);
                values[i] = obj;
            }

            // Sort key array and value arrays in tandem to keep the relation between keys and values.

            Array.Sort(keys, values);

            // Now you can use StartsWith() to return the indices of strings in keys[]
            // that start with a specific string. The indices can be used to look up the
            // corresponding values in values[].

            Console.WriteLine("Count of ZZ = " + StartsWith(keys, "ZZ").Count());

            // Test a load of times with 1000 random prefixes.

            var prefixes = new string[1000];

            for (int i = 0; i < 1000; ++i)
                prefixes[i] = randomString(1, 8);

            var sw = Stopwatch.StartNew();

            for (int i = 0; i < 1000; ++i)
                for (int j = 0; j < 1000; ++j)
                    StartsWith(keys, prefixes[j]).Any();

            Console.WriteLine("1,000,000 checks took {0} for {1} ms each.", sw.Elapsed, sw.ElapsedMilliseconds/1000000.0);
        }

        public static IEnumerable<int> StartsWith(string[] array, string prefix)
        {
            int index = Array.BinarySearch(array, prefix);

            if (index < 0)
                index = ~index;

            // We might have landed partway through a set of matches, so find the first match.

            if (index < array.Length)
                while ((index > 0) && array[index-1].StartsWith(prefix, StringComparison.OrdinalIgnoreCase))
                    --index;

            while ((index < array.Length) && array[index].StartsWith(prefix, StringComparison.OrdinalIgnoreCase))
                yield return index++;
        }

        static string randomString(int minLength, int maxLength)
        {
            int length = rng.Next(minLength, maxLength);

            const string CHARS = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz";
            return new string(Enumerable.Repeat(CHARS, length)
              .Select(s => s[rng.Next(s.Length)]).ToArray());
        }

        static readonly Random rng = new Random(12345);
    }
}
Matthew Watson
  • 104,400
  • 10
  • 158
  • 276
0

So, after little test I found something close enought with usage BinarySearch only Cons is that you have to sort keys from a to z. But the biggest the list, the slower it will be so Ternary Search is fastest from all you can actualy found with binary pc architecture.

Method: (Credits shoult go to @Guffa)

    public static int BinarySearchStartsWith(List<string> words, string prefix, int min, int max)
    {
        while (max >= min)
        {
            var mid = (min + max) / 2;
            var comp = string.CompareOrdinal(words[mid].Substring(0, prefix.Length), prefix);
            if (comp >= 0)
            {
                if (comp > 0)
                    max = mid - 1;
                else
                    return mid;
            }
            else
                min = mid + 1;
        }
        return -1;
    }

and test implementation

        var keysToList = dictionary.Keys.OrderBy(q => q).ToList();
        sw = new Stopwatch();
        sw.Start();
        foreach (var lookup in lookups)
        {
            bool exist = BinarySearchStartsWith(keysToList, lookup, 0, keysToList.Count - 1)!= -1
        }
        sw.Stop();

enter image description here

Community
  • 1
  • 1
Sebastian 506563
  • 6,980
  • 3
  • 31
  • 56