6

I have a

List<string>

with 1500 strings. I am now using the following code to pull out only string that start with the string prefixText.

foreach(string a in <MYLIST>)
{            
    if(a.StartsWith(prefixText, true, null))
    {
        newlist.Add(a);                   
    }            
}

This is pretty fast, but I'm looking for google fast. Now my question is if I arrange the List in alphabetical order, then compare char by char can I make this faster? Or any other suggestions on making this faster?

Julian
  • 20,008
  • 17
  • 77
  • 108
Scott Selby
  • 9,420
  • 12
  • 57
  • 96

10 Answers10

14

Thus 1500 is not really a huge number binary search on sorted list would be enough probably. Nevertheless most efficient algorithms for prefix search are based on the data structure named Trie or Prefix Tree. See: http://en.wikipedia.org/wiki/Trie

Following picture demonstrates the idea very briefly: enter image description here

For c# implementation see for instance .NET DATA STRUCTURES FOR PREFIX STRING SEARCH AND SUBSTRING (INFIX) SEARCH TO IMPLEMENT AUTO-COMPLETION AND INTELLI-SENSE

George Mamaladze
  • 7,593
  • 2
  • 36
  • 52
4

If you have the list in alpabetical order, you can use a variation of binary search to make it a lot faster.

As a starting point, this will return the index of one of the strings that match the prefix, so then you can look forward and backward in the list to find the rest:

public static int BinarySearchStartsWith(List<string> words, string prefix, int min, int max) {
  while (max >= min) {
    int mid = (min + max) / 2;
    int comp = String.Compare(words[mid].Substring(0, prefix.Length), prefix);
    if (comp < 0) {
      min = mid + 1;
    } else if (comp > 0) {
      max = mid - 1;
    } else {
      return mid;
    }
  }
  return -1;
}

int index = BinarySearchStartsWith(theList, "pre", 0, theList.Count - 1);
if (index == -1) {
  // not found
} else{
  // found
}

Note: If you use a prefix that is longer than any of the strings that are compared, it will break, so you might need to figure out how you want to handle that.

Guffa
  • 687,336
  • 108
  • 737
  • 1,005
  • a short example on how to invoke this would be convenient. For example the min and max parameters might not be superobvious? – vidstige May 06 '12 at 18:29
  • 1
    .Net already has [Array.BinarySearch](http://msdn.microsoft.com/en-us/library/system.array.binarysearch.aspx) – L.B May 06 '12 at 18:32
4

You can use PLINQ (Parallel LINQ) to make the execution faster:

var newList = list.AsParallel().Where(x => x.StartsWith(prefixText)).ToList()
Adi Lester
  • 24,731
  • 12
  • 95
  • 110
4

So many approches were analyzed to achive minimum data capacity and high performance. The first place is: all prefixes are stored in dictionary: key - prefix, values - items appropriate for prefix.

Here simple implementation of this algorithm:

public class Trie<TItem>
{
    #region Constructors

    public Trie(
        IEnumerable<TItem> items,
        Func<TItem, string> keySelector,
        IComparer<TItem> comparer)
    {
        this.KeySelector = keySelector;
        this.Comparer = comparer;
        this.Items = (from item in items
                      from i in Enumerable.Range(1, this.KeySelector(item).Length)
                      let key = this.KeySelector(item).Substring(0, i)
                      group item by key)
                     .ToDictionary( group => group.Key, group => group.ToList());
    }

    #endregion

    #region Properties

    protected Dictionary<string, List<TItem>> Items { get; set; }

    protected Func<TItem, string> KeySelector { get; set; }

    protected IComparer<TItem> Comparer { get; set; }

    #endregion

    #region Methods

    public List<TItem> Retrieve(string prefix)
    {
        return  this.Items.ContainsKey(prefix)
            ? this.Items[prefix]
            : new List<TItem>();
    }

    public void Add(TItem item)
    {
        var keys = (from i in Enumerable.Range(1, this.KeySelector(item).Length)
                    let key = this.KeySelector(item).Substring(0, i)
                    select key).ToList();
        keys.ForEach(key =>
        {
            if (!this.Items.ContainsKey(key))
            {
                this.Items.Add(key, new List<TItem> { item });
            }
            else if (this.Items[key].All(x => this.Comparer.Compare(x, item) != 0))
            {
                this.Items[key].Add(item);
            }
        });
    }

    public void Remove(TItem item)
    {
        this.Items.Keys.ToList().ForEach(key =>
        {
            if (this.Items[key].Any(x => this.Comparer.Compare(x, item) == 0))
            {
                this.Items[key].RemoveAll(x => this.Comparer.Compare(x, item) == 0);
                if (this.Items[key].Count == 0)
                {
                    this.Items.Remove(key);
                }
            }
        });
    }

    #endregion
}
1

I assume that the really fastest way would be to generate a dictionary with all possible prefixes from your 1500 strings, effectively precomputing the results for all possible searches that will return non-empty. Your search would then be simply a dictionary lookup completing in O(1) time. This is a case of trading memory (and initialization time) for speed.

private IDictionary<string, string[]> prefixedStrings;

public void Construct(IEnumerable<string> strings)
{
    this.prefixedStrings =
        (
            from s in strings
            from i in Enumerable.Range(1, s.Length)
            let p = s.Substring(0, i)
            group s by p
        ).ToDictionary(
            g => g.Key,
            g => g.ToArray());
}

public string[] Search(string prefix)
{
    string[] result;
    if (this.prefixedStrings.TryGetValue(prefix, out result))
        return result;

    return new string[0];
}
Douglas
  • 53,759
  • 13
  • 140
  • 188
1

You can accelerate a bit by comparing the first character before invoking StartsWith:

char first = prefixText[0];

foreach(string a in <MYLIST>) 
    {    
         if (a[0]==first)
         {        
            if(a.StartsWith(prefixText, true, null)) 
            { 
                newlist.Add(a);                    
            }
         }             
    } 
1

1500 is usually too few:

  • you could search it in parallel with a simple divide and conquer of the problem. Search each half of the list in two (or divide into three, four, ..., parts) different jobs/threads.

  • Or store the strings in a (not binary) tree instead. Will be O(log n).

  • sorted in alphabetical order you can do a binary search (sort of the same as the previous one)

Mattias Isegran Bergander
  • 11,811
  • 2
  • 41
  • 49
  • Agree. 1500 is to few. Nevertheless there are more efficient algorithms for prefix search then your three options. See my answer below. – George Mamaladze May 06 '12 at 18:38
  • Yes @achitaka-san and Douglas answer is O(1) even. A prefix tree is my second point above, but not nearly as well described as your solution (which I up voted early on). Should've been more clear perhaps. It's not really O(log n) I guess though as I stated, depends on the strings and key so not related to that (see achitaka-san's Wikipedia link or your favorite data structure text book). In a worst case the strings are long and only differ in the end for example etc etc. – Mattias Isegran Bergander May 06 '12 at 18:51
0

Have you tried implementing a Dictionary and comparing the results? Or, if you do put the entries in alphabetical order, try a binary search.

Randy Minder
  • 47,200
  • 49
  • 204
  • 358
  • How to I go about putting List in alphabetical order? Also , it doesn't have to be a List, The strings originally come from DB in Select statement, so they can be in any format – Scott Selby May 06 '12 at 18:16
  • 1
    There are a couple ways to do this. If they came from a DB, you could have the DB do the sorting, which would be the most efficient. Or you could use Ling as follows: MyList = UnsortedList.OrderBy( someField ). – Randy Minder May 06 '12 at 18:19
0

The question to me is whether or not you'll need to do this one time or multiple times.

If you only find the StartsWithPrefix list one time, you can't get faster then leaving the original list as is and doing myList.Where(s => s.StartsWith(prefix)). This looks at every string one time so it's O(n)

If you need to find the StartsWithPrefix list several times, or maybe you're going to want to add or remove strings to the original list and update the StartsWithPrefix list then you should sort the original list and use binary search. But this will be sort time + search time = O(n log n) + 2 * O(log n)

If you did the binary search method, you would find the indexes of the first occurrence of your prefix and the last occurrence via search. Then do mySortedList.Skip(n).Take(m-n) where n is first index and m is last index.

Edit:

Wait a minute, we're using the wrong tool for the job. Use a Trie! If you put all your strings into a Trie instead of the list, all you have to do is walk down the trie with your prefix and grab all the words underneath that node.

jb.
  • 9,921
  • 12
  • 54
  • 90
-2

I would go with using Linq:

 var query = list.Where(w => w.StartsWith("prefixText")).Select(s => s).ToList();
Mitja Bonca
  • 4,268
  • 5
  • 24
  • 30