2

Hello I am trying to create a very fast algorithm to detect keywords or list of keywords in a collection.

Before anything, I have read a lot of stackoverflow (and other) posts without being able to improve the performances to the level I expect.

My current solution is able to analyze an input of 200 chars and a collection of 400 list in 0.1825ms (5 inputs analyzed in 1 ms) but this is way too long and I am hoping to improve this performances by at least 5 times (which is the requirement I have).

Solution tested :

  • Manual research
  • Highly complex Regex (groups, backreferences...)
  • Simple regex called multiple times (to match each of the keyword)
  • Simple regex to match input keywords followed by an intersect with the tracked keywords (current solution)
  • Multi-threading (huge impact on performances (*100) so I am not sure that would be the best solution for this problem)

Current solution :

input (string) : string to parse and analyze to verify the keywords list contained in it. Example : "hello world! How are you Mr #piloupe?".

tracks (string[]) : array of string that we want to match (space means AND). Example : "hello world" matches a string that contains both 'hello' and 'world' whatever their location

keywordList (string[][]) : list of string to match from the input. Example : { { "hello" }, { "#piloupe" }, { "hello", "world" } }

uniqueKeywords (string[]) : array of string representing all the unique keywords of the keywordList. With the previous keywordList that would be : { "hello", "#piloupe", "world" }

All these previous information does not require any performances improvement as they are constructed only once for any input.

Find the tracks algorithm:

// Store in the class performing the queries
readonly Regex _regexToGetAllInputWords = new Regex(@"\#\w+|\w+", RegexOptions.Compiled);

List<string> GetInputMatches(input)
{
    // Extract all the words from the input
    var inputWordsMatchCollection = _regexToGetAllInputWords.Matches(input.ToLower()).OfType<Match>().Select(x => x.Value).ToArray();

    // Get all the words from the input matching the tracked keywords
    var matchingKeywords = uniqueKeywords.Intersect(inputWordsMatchCollection).ToArray();

    List<string> result = new List<string>();

    // For all the tracks check whether they match
    for (int i = 0; i < tracksKeywords.Length; ++i)
    {
        bool trackIsMatching = true;

        // For all the keywords of the track check whether they exist
        for (int j = 0; j < tracksKeywords[i].Length && trackIsMatching; ++j)
        {
            trackIsMatching = matchingKeywords.Contains(tracksKeywords[i][j]);
        }

        if (trackIsMatching)
        {
            string keyword = tracks[i];
            result.Add(keyword);
        }
    }

    return result;
}

Any help will be greatly appreciated.

user2465083
  • 595
  • 3
  • 12
  • How are you testing performance? – japreiss Sep 16 '13 at 14:24
  • Just glancing at your code, it seems that using Parallel LINQ could speed things up on a multi core system. – kwcto Sep 16 '13 at 14:33
  • Have you broken down the timings on this at all? I'd be very interested to know whether before or after `List result = new List();` was taking longer. – penguat Sep 16 '13 at 14:37
  • http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore%E2%80%93Horspool_algorithm – Kevin DeVoe Sep 16 '13 at 14:39
  • How are you testing performance : I am starting the Stopwatch before calling the Method and Stop it after the call. – user2465083 Sep 16 '13 at 14:58
  • I tried parallelLinq and the number of processing (between 400 and 2000) is too small to improve the peformances. – user2465083 Sep 16 '13 at 14:58
  • Have you broken down the timings on this at all? I'd be very interested to know whether before or after List result = new List(); was taking longer. -> I don't understand your point sorry :P – user2465083 Sep 16 '13 at 14:59
  • PS : I am not expert on parallel linq so you might have an idea on how to improve it with PL but cannot find a way to make the call the Threading less costly than doing the operations in the main thread – user2465083 Sep 16 '13 at 15:05
  • Eric Lippert has [good advice](http://ericlippert.com/2012/12/17/performance-rant/) on answering “which is faster?” questions. – Dour High Arch Sep 16 '13 at 16:19
  • Are you doing your timings in Release mode, without the debugger attached? Are you running thousands (or millions) of iterations and averaging the results? Are you taking JIT time into account? See Eric Lippert's series on Benchmarking Mistakes. [Part 1](http://tech.pro/blog/1293/c-performance-benchmark-mistakes-part-one). Links to the other parts are in the sidebar. – Jim Mischel Sep 16 '13 at 16:26
  • If you are searching many input strings for the same keywords, then something like the [Aho-Corasick string matching](http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm) algorithm is the way to go. Especially if the input strings are very long. – Jim Mischel Sep 16 '13 at 16:28
  • @Jim Mischel I am performing 20 iterations for my tests and doing an average. I am not taking JIT into account, don't even know what should be taken into account with JIT. Will read the post and keep up to date. I am not using very long input strings (between 100 and 200 chars) but I'll have a look to the algorithm – user2465083 Sep 16 '13 at 16:52
  • The Aho-Corasick algorithm cannot be used in my case as I am looking to get the specific keyword and not match 'sub-keywords' – user2465083 Sep 16 '13 at 17:26

3 Answers3

1

The short answer is to parse every word, and store it into a binary tree-like collection. SortedList or SortedDictionary would be your friend here.

With very little code, you can add your words to a SortedList and then do a .BinarySearch() on that SortedList. This is a O(log n) implementation and you should be able to search through thousands or millions of words in a few iterations. When using SortedList, the performance issue will be on the inserts to SortedList (since it will sort while inserting). But this is necessary to do a binary search.

I wouldn't bother with threading since you need results in less than 1ms.

The long answer is to look at something like Lucene, which can be especially helpful if you're doing an autocomplete-style search. RavenDB uses Lucene under the covers and can do background indexing for you, it will search through millions of records in a few milliseconds.

Tim P.
  • 2,903
  • 24
  • 26
0

I would like to suggest using hash table. with hashing you can convert string text to integer representing the index of this string in hash table. It's much more faster than sequential search.

Maher
  • 294
  • 1
  • 13
  • Hi there :) Would you please be more specific as I already tried solution with HashSet. – user2465083 Sep 16 '13 at 16:47
  • hi bud :) this topic has previously discussed please take a look here: http://stackoverflow.com/questions/114085/fast-string-hashing-algorithm-with-low-collision-rates-with-32-bit-integer tell me if that helps :) – Maher Sep 16 '13 at 16:56
0

The ultimate solution is the Elastic binary tree data structure. It is used in HAProxy to match rules against URLs in the proxied HTTP requests (and for many other purposes as well).
ebtree is a data structure built from your 'keyword' patterns, which allows faster matching than either SortedList or hashing. To be faster than hashing is possible because hashing reads the input string once (or at least several characters of it) to generate hash code, then again to evaluate .Equals(). Thus hashing reads all characters of the input 1+ times. ebtree reads all characters at most once and finds the match, or if there's no match, tells it after O(log(n)) characters where n is the number of patterns.
I'm not aware of existing C# implementation of ebtree, but surely many would be pleased if someone would take it.

robert4
  • 1,072
  • 15
  • 20