5

I had a good experience at the speed of regex in JS.

And I decided to make a small comparison. I ran the following code:

var str = "A regular expression is a pattern that the regular expression engine attempts to match in input text.";

var re = new RegExp("t", "g");

console.time();

for(var i = 0; i < 10e6; i++)
   str.replace(re, "1");

console.timeEnd();

The result: 3888.731ms.

Now in C#:

var stopwatch = new Stopwatch();

var str = "A regular expression is a pattern that the regular expression engine attempts to match in input text.";

var re = new Regex("t", RegexOptions.Compiled);

stopwatch.Start();

for (int i = 0; i < 10e6; i++)
    re.Replace(str, "1");

stopwatch.Stop();

Console.WriteLine( stopwatch.Elapsed.TotalMilliseconds);

Result: 32798.8756ms !!

Now, I tried re.exec(str); vs Regex.Match(str, "t");: 1205.791ms VS 7352.532ms in favor of JS.

Is massive text processing "Not suitable" subject to be done in .net?

UPDATE 1 same test with [ta] pattern (instead t literal):

3336.063ms in js VS 64534.4766!!! in c#.

another example:

console.time();

var str = "A regular expression is a pattern that the regular expression engine attempts 123 to match in input text.";


var re = new RegExp("\\d+", "g");
var result;
for(var i = 0; i < 10e6; i++)
    result = str.replace(str, "$0");
   

console.timeEnd();

3350.230ms in js, vs 32582.405ms in c#.

Community
  • 1
  • 1
dovid
  • 6,354
  • 3
  • 33
  • 73
  • Have you tried a precompiled regex in c#? – Sefe Dec 16 '17 at 18:26
  • 1
    I was able to reproduce the c# performance, Release/Any CPU (64 bit)/Not Running in Visual Studio. My time using RegexOptions.None: 46509.2514 ms. My time using RegexOptions.Compiled: 36174.9981 ms. – dbc Dec 16 '17 at 18:33
  • Assign str.replace(re, "1"); to something to ensure JS is not considering it a no-op and optimizing it away – Alex K. Dec 16 '17 at 18:35
  • @AlexK. `result = str.replace(str, "1");` = 3026.953ms – dovid Dec 16 '17 at 18:37
  • Curious to know what is your CPU model? – revo Dec 16 '17 at 19:05
  • @revo i3 4160. the C# code run from linqpad x86. the js by node. – dovid Dec 16 '17 at 19:11
  • Running on a i7 7700hq environment js version takes 8509ms. Weird. – revo Dec 16 '17 at 19:15
  • Could you test it again but this time with a pattern that isn't a literal character, for example a simple character class: `[ta]`. – Casimir et Hippolyte Dec 16 '17 at 19:54
  • @CasimiretHippolyte see update – dovid Dec 16 '17 at 20:11
  • 2
    Ok, however, I don't see why the update 2 test is "more useful". As an aside, when you write `\d+` in a double quoted string for the `RegExp` constructor, it is interpreted as `d+` *(the non-sense escape is simply ignored and the next character is seen as a literal)*. To figure the `\d` character class inside a double quoted string you have to use two backslashes: `var re=RegExp("\\d+", "g");`. Note that writing `var re=/\d+/g;` or `var re=RegExp(/\d+/g);` is exactly the same *(none of these versions are compiled earlier or later.)* – Casimir et Hippolyte Dec 16 '17 at 20:40
  • Did you try to use `regex.Matches()` and build the result with a `StringBuilder` instance to see if it changes anything? – Casimir et Hippolyte Dec 16 '17 at 20:47
  • I recommend comparing in IE too if you are comparing only Chrome – Slai Dec 16 '17 at 22:41
  • see also https://github.com/mariomka/regex-benchmark and https://github.com/dotnet/corefx/issues/24333 – dovid Dec 17 '17 at 11:11
  • https://devblogs.microsoft.com/dotnet/regular-expression-improvements-in-dotnet-7/ – dovid May 15 '22 at 06:19

2 Answers2

3

String in C# is a dangerous beast and you really can shoot yourself in the foot if you use it carelessly, but I don't think given test is representative enough to warrant any generalizations.

First, I did reproduce similar performance for your test case. Adding RegexOptions.Compiled reduced the required time to 30-ish seconds, but this is still significant difference.

The specific test case is probably not a too realistic one, as who would use regex for single char replace? Should you use a dedicated API for this task, you would get comparable results str.Replace('t', '1'); was 1600ms on my machine.

This means for this specific task C# performance is comparable to JS. Whether the C# Regex.Replace() is internally somehow not suitable for single-char replaces or if JS regex version is optimizing the regex away - some JS guru should answer that.

Would a more realistic complex regex have a notable difference - would be interesting to know.

Edit: I verified that the performance gap remains when the replace results are actually used and when input strings differ in each run (10s vs 35s in my tests). So gap is less, but still there.

Possible reasons

According to hints from this SO question browser implementations delegate some string operations to optimized c++ code. If they do this for string concat, they probably do that for Regex as well. AFAIK, C# Regex ans String classes stay in managed world and that brings some baggage.

Imre Pühvel
  • 4,468
  • 1
  • 34
  • 49
  • Add a number to the string, and change the expression to `\d+`. I think this is a useful classic case. The results are similar (4 vs 31 seconds). – dovid Dec 16 '17 at 20:01
1

One of the reasons for the big difference between JS regex and .NET regex is that JS lacks quite a number of advanced features, however .NET is very feature-rich.

Here's two quotes from regular-expressions.info:

JavaScript:

JavaScript implements Perl-style regular expressions. However, it lacks quite a number of advanced features available in Perl and other modern regular expression flavors:

No \A or \Z anchors to match the start or end of the string. Use a caret or dollar instead.

No atomic grouping or possessive quantifiers.

No Unicode support, except for matching single characters with \uFFFF.

No named capturing groups. Use numbered capturing groups instead.

No mode modifiers to set matching options within the regular expression.

No conditionals.

No regular expression comments. Describe your regular expression with JavaScript // comments instead, outside the regular expression string.

.NET Framework:

The Microsoft .NET Framework, which you can use with any .NET programming language such as C# (C sharp) or Visual Basic.NET, has solid support for regular expressions. .NET's regex flavor is very feature-rich. The only noteworthy feature that's lacking are possessive quantifiers.

Community
  • 1
  • 1
codeDom
  • 1,623
  • 18
  • 54