16

I have a list of approx. 500,000 strings, each approx. 100 characters long. Given a search term, I want to identify all strings in the list that contain the search term. At the moment I am doing this with a plain old dataset using the Select method ("MATCH %term%"). This takes about 600ms on my laptop. I'd like to make it faster, maybe 100-200ms.

What would be a recommended approach?

Performance is critical so I can trade memory footprint for better performance if necessary (within reason). The list of strings will not change once initialised so calculating hashes would also be an option.

Does anyone have a recommendation and which C# data structures are best suited to the task?

njr101
  • 9,499
  • 7
  • 39
  • 56
  • 4
    This task seems quite suited for parallel execution. How many CPU cores you got? That plus Boyer Moore should give good enough improvement. – leppie Sep 29 '11 at 16:20
  • 5
    Have you tried loading into a database and let the database engine do what its good at doing? – Mongus Pong Sep 29 '11 at 16:25
  • Are you looking for words or arbitrary substrings? – Gabe Sep 29 '11 at 16:27
  • MongusPong: I don't have a database and I don't think this would be any quicker. A DB could not take advantage of indexing because of the "LIKE %...%" clause. I'm looking for a tailor-made optimization. Gabe: Arbitrary substrings. If I were looking for words I could tokenize the strings and build my own index. – njr101 Sep 30 '11 at 08:31
  • @leppie Why BM? There are multiple string search algorithms (Aho-Corasick, Wu-Manber) or tries. – Konrad Rudolph Sep 30 '11 at 14:21
  • @KonradRudolph: It would be simple, and only a one time setup (or any of the other classic string search algorithms). I dont know the others you mention, but tries would certainly be more inefficient ito complexity and that it is not really suited for a single string search. – leppie Sep 30 '11 at 15:25
  • Protip (IIRC, .NET 1.1): A `Regex` will attempt a Boyer Moore search for simple patterns, eg "foobar". – leppie Sep 30 '11 at 15:27
  • @leppie Sorry, I mistakenly assumed that the OP needed a multi-pattern search. – Konrad Rudolph Sep 30 '11 at 21:19
  • Can you share your final option? @nrj101 ? – Daniel Sep 27 '18 at 08:26

6 Answers6

19

I've heard good things about Lucene.NET when it comes to performing quick full-text searches. They've done the work to figure out the fastest data structures and such to use. I'd suggest giving that a shot.

Otherwise, you might just try something like this:

var matches = list.AsParallel().Where(s => s.Contains(searchTerm)).ToList();

But it probably won't get you down to 100ms.

StriplingWarrior
  • 151,543
  • 27
  • 246
  • 315
  • Thanks for suggestion. Lucene seems like quite a big overhead (in terms of framework) for just this one situation but I could look at it. Do you know if it handles *arbitrary* substrings? S free-text search will not help if it tokenizes the strings into words. – njr101 Sep 30 '11 at 08:33
  • @njreed.myopenid.com: I updated my answer based on my comments on another question. Given the requirements, though, you'll likely need an algorithm change to get the performance boost you're looking for. – StriplingWarrior Sep 30 '11 at 14:16
  • 2
    Thanks @StriplingWarrior. The search is now down to about 50ms. I don't need to implement any special algorithm. The combination of using Linq with a list of strings was enough. Guess the dataset was brining more overhead with it than I thought. – njr101 Oct 13 '11 at 14:42
4

Have you tried the following?

list.FindAll(x => x.Contains("YourTerm")).ToList();

For some reason the List.AsParallel().Where(...) is slower than list.FindAll(...) on my PC.

list.AsParallel().Where(x => x.Contains("YourTerm")).ToList();

Hope this will help you.

Fever
  • 251
  • 1
  • 5
  • 12
2

A trie or suffix tree would help in making this faster - this is essentially what fulltext search (usually) is using.

There are implementations in C# you can use, also see this SO thread: Looking for the suffix tree implementation in C#?

Also as mentioned by @leppie parallel execution will likely be already provide you with the x3 performance gain you are looking for. But then again you will have to measure closely, without that it's anyone's guess.

Community
  • 1
  • 1
BrokenGlass
  • 158,293
  • 28
  • 286
  • 335
  • Thanks for suggestion. I'll see how much boost parallel execution gives and then see if I need a change to the implementation. – njr101 Sep 30 '11 at 08:40
1

Have you tried loading your strings into a List<string> and then using the Linq extensions Contains method?

var myList = new List<string>();
//Code to load your list goes here...

var searchTerm = "find this";
var match = myList.Contains(searchTerm);
Paige Cook
  • 22,415
  • 3
  • 57
  • 68
  • 2
    I think that will only match if one of the strings in the list is exactly what's being searched for. The person asking the question wants to look inside all of those strings as well. – RandomEngy Sep 29 '11 at 16:29
  • 1
    Yup, to fit the OP, you'd have to do `myList.Where(s => s.Contains(searchTerm))`. You could throw an `AsParallel` in there to make the query run in parallel, and you might tweak the parameters to `string.Contains` if you know that you want an exact match instead of a culture-aware one. Even so, you wouldn't get an order of magnitude of difference in speed. – StriplingWarrior Sep 29 '11 at 16:33
  • Good suggestion Stripling. The parallel processing might just give me the boost I need. I need to try this. – njr101 Sep 30 '11 at 08:35
  • Thanks @Paige-Cook, using a list of strings was a great suggestion; it's much faster. The search is now down to about 50ms. I can only accept one answer so I've accepted Stripling because it included the use of Linq to query the list as required in the question. – njr101 Oct 13 '11 at 14:40
1
public static bool ContainsFast<T>(this IList<T> list, T item)
{
    return list.IndexOf(item) >= 0;
}

Base on tests that I did, this variation of Contains was about 33% faster on my side.

SuperOli
  • 1,784
  • 1
  • 11
  • 23
0

According to these benchmarks, the fastest way to check if a string occurs in a string is the following:

for (int x = 0; x < ss.Length; x++)
    for (int y = 0; y < sf.Length; y++
        c[y] += ((ss[x].Length - ss[x].Replace(sf[y], String.Empty).Length) / sf[y].Length > 0 ? 1 : 0);

Thus, you could:

  1. Loop through the list using a Parallel.For construct
  2. Implement the code above to check if a string contains what you're looking for. "SS" is the string[] of strings to search; "SF" is the string[] of strings to search for; c[y] is the total count of each one found.

Obviously you'd have to adapt them to your List[string] (or whatever data structure you're using).