4

I have the following code that works, but would like to speed it up using LINQ (or something else) to find if any of the Regex search strings are in the target.

List<Regex> Filters = new List<Regex>();
Filters.Add(new Regex("string1", RegexOptions.IgnoreCase));
Filters.Add(new Regex("string2", RegexOptions.Compile));
...
bool found = false
string target = "string which may contain string1 or string2 or neither";
foreach (Regex r in Filters) {
   if (r.IsMatch(target)) {
       found = true;
       break; // get out as soon as found
   }
}
if (found) { // do stuff }

The search is currently taking a long time for the large files being processed. Is there a way to make .Any or .First get this done more efficiently?

nimchimpsky
  • 99
  • 11
  • May be you should parallelize your code (PLINQ)? – JohnyL May 05 '18 at 18:31
  • Linq won´t make your code faster at all, it just hides complexity and introduces something that may lead to much cnfusion, if you´re not familiar with it - deferred execution. So in short: don´t use LINQ for the sake of some optimization. Having said this your code is as fast as it can, at least when you relly want to use regex. – MakePeaceGreatAgain May 05 '18 at 18:35
  • This is an inefficient use of the regex engine. Combine the two regexes into 1 regex using an alternation. `new Regex("string1|string2");` –  May 05 '18 at 18:36
  • If you have _many strings_ use this tool _[HERE](http://www.regexformat.com/version7_files/Rx5_ScrnSht01.jpg)_ to create a full blown regex _trie_, the fastest search mechanisn known to man. And _[HERE](http://www.regexformat.com/scrn8/TernConv.jpg)_ it creates a trie for all emoji. –  May 05 '18 at 18:39
  • 1
    Filters.Any(x => x.IsMatch(target)); ?? – Diogo Neves May 05 '18 at 18:45
  • `if (Filters.AsParallel().Any(x => x.IsMatch(target)))` is slower, so PLINQ doesn't help. – nimchimpsky May 06 '18 at 02:15

1 Answers1

4

As hinted, the easiest simplification using LinQ can be achieved with All (to require all conditions are met) or Any (to connect your regex conditions in a || fashion).

List<Regex> Filters = new List<Regex>();
Filters.Add(new Regex("string1", RegexOptions.IgnoreCase, RegexOptions.Compiled));
Filters.Add(new Regex("string2", RegexOptions.Compiled));
string target = "string which may contain string1 or string2 or neither";
if (Filters.Any(x => x.IsMatch(target)))
{
    // do stuff }
}

However, if you want to mix All/Any you may want to consider to write your own extensions method that combines both to avoid evaluating the input more than once. @jonskeet has a neat example here.

Nonetheless, probably the biggest gain can be made by combining and optimizing your regex patterns. Hand-optimized patterns are usually best, but you can try your luck with the following two Perl modules to get it done automatically :

  • Dan Kogai's Regexp-Optimizer-0.23 to optimize/assemble patterns
use Regexp::Optimizer;
my $o  = Regexp::Optimizer->new->optimize(qr/foobar|fooxar|foozap/);
# $re is now qr/foo(?:[bx]ar|zap)/

Just for demonstrational purposes, feeding your sample patterns with alternations to the optimizer:
The raw match pattern: string1|string2 becomes the optimized match pattern: string[12].

wp78de
  • 18,207
  • 7
  • 43
  • 71