Faster replacement for Regex

Question

I have in class around 100 Regex calls, every call cover different type of data in text protocol, but i have many files and based on analytics regex took 88% of execution of my code.

Many this type of code:

{
  Match m_said = Regex.Match(line, @"(.*) said,", RegexOptions.IgnoreCase);
  if (m_said.Success)
  {
    string playername = ma.Groups[1].Value;
    // some action
    return true;
  }
}

{
  Match ma = Regex.Match(line, @"(.*) is connected", RegexOptions.IgnoreCase);
  if (ma.Success)
  {
    string playername = ma.Groups[1].Value;
    // some action
    return true;
  }
}
{
  Match ma = Regex.Match(line, @"(.*): brings in for (.*)", RegexOptions.IgnoreCase);
  if (ma.Success)
  {
    string playername = ma.Groups[1].Value;
    long amount = Detect_Value(ma.Groups[2].Value, line);
    // some action
    return true;
  }
}

Is any way to replace Regex with some other faster solution?

I think it depends on the kind of regex you are using... provide some samples! — Marcelo Oliveira, Jan 20 '12 at 12:24

Seki · Accepted Answer · 2012-01-20T14:48:18.953

8

For regexps that are tested in loop, it is often faster to precompile them outside of the loop and just test them inside of the loop.

You need to declare the different regexps first with their respective patterns and only call the Match() with the text to test in a second step.

edited Jan 20 '12 at 14:48

answered Jan 20 '12 at 12:36

Seki

11,135
7
46
70

1

Doesn't the RegEx class have a cache? – H H Jan 20 '12 at 12:59
@HenkHolterman: Indeed. I have just checked that in the doc. (I am more accustomed to PCRE that does not provides a cache mechanism.) The cache should work for the static `Regex.Match()` calls made by the OP. Or there are too many regexps involved and `Regex.CacheSize` is a path to explore to improve the performance (but I doubt)? – Seki Jan 20 '12 at 13:17
1

@Svisstack: You are welcome ^_^ - Just for my own archives, what did the trick ? Allocating the Regex outside of the loop and / or tweaking Regex.CacheSize ? – Seki Jan 23 '12 at 09:54
@HenkHolterman see: http://stackoverflow.com/questions/513412/how-does-regexoptions-compiled-work/7707369#7707369 – Sam Saffron Feb 10 '12 at 04:39

Tim Pietzcker · Answer 2 · 2012-01-20T15:41:09.087

Aside from precompiling your regex, you could gain (probably much more) performance benefits by writing a more precise regex. In this respect, .* is almost always a bad choice:

(.*) is connected means: First match the entire string (that's the .* part), then backtrack one character at a time until it's possible to match is connected.

Now unless the string is very short or is connected appears very close to the end of the string, that's a lot of backtracking which costs time.

So if you can refine what an allowed match is, you can improve performance.

For example, if only alphanumeric characters are allowed, then (\w+) is connected will be good. If it's any kind of non-whitespace characters, then use (\S+) is connected. Etc., depending on the rules for a valid match.

In your concrete example, you don't appear to be doing anything with the captured match, so you could even drop regex altogether and just look for a fixed substring. Which method will be the fastest in the end depends a lot on your actual input and requirements.

Gah! I was about to write a comment about one of the great strengths of regexps being that they could be compiled to DFAs, no backtracking required. But then I looked at the docs for the .NET implementation of regexps and [*you're right!*](http://msdn.microsoft.com/en-us/library/dsy130b4.aspx) — Ryan Culpepper, Jan 20 '12 at 16:58

score 2 · Answer 3 · answered Jan 20 '12 at 16:28

I don't know if you can re-use the expressions, or if the method is called multiple times, but if so you should precompile your regular expressions. Try this:

private static readonly Regex xmlRegex = new Regex("YOUR EXPRESSION", RegexOptions.Compiled);

In your sample, each time the method is used it 'compiles' the expression, but this is unneccesary as the expression is a const. Now it is precompiled this compiled only once. Disadvantage is that the first time you access the expression, it is a bit slower.

score 1 · Answer 4 · answered Jan 20 '12 at 12:44

1

You could try compiling the Regex beforehand or consider combining all the individual Regex expressions into one (monster) Regex:

Match m_said = Regex.Match(line,
            @"(.*) (said|(is connected)|...|...),",
            RegexOptions.IgnoreCase);

You can then test the second capturing group to determine which type of match occurred.

answered Jan 20 '12 at 12:44

adelphus

10,116
5
36
46

This is the only answer I see that attempts to scan the file(s) only once. – H H Jan 20 '12 at 12:55
Depending on the possible texts, there might be a problem with the @"(.*) said" version alltogether: What if `User17 said "I have already said that!"` - the regex will find the wrong "said" – Eugen Rieck Jan 20 '12 at 13:13
1

I can't do that, lines have various formats non for example X command Y. – Svisstack Jan 20 '12 at 13:29
@EugenRieck The OP question (and my answer) was to improve the speed of the code he provided whilst retaining identical behaviour. Trying to read his mind about what he *actually* meant to do is fraught with problems. – adelphus Jan 20 '12 at 13:50
1

@Svisstack Does it matter? Simply extend the pattern to cover all the possibilities: @"(.*) (said,|(is connected)|(\: brings in for (.*)))". I suspect your code is slow because every Regex.Match has to scan the string individually. It would be worth the time to form one (or a minimal) set of regex's that cover the possibilities. – adelphus Jan 20 '12 at 13:56
But i have other patterns too this is only 3 from 100, have patterns like X Y text Z (Z/Y) ble ble [X] [Y] etc – Svisstack Jan 20 '12 at 14:14
With that i can group only several regexes in costs of more complicated regex, this can be cause not performance gain, but slows because regex is X more compliated probably will work X more slowly – Svisstack Jan 20 '12 at 14:15
1

@Svisstack "probably will work X more slowly". Um, have you tried it? Regex engines are pretty smart creatures. If every one of your regex's has a fixed part (said, is connected, etc) it should run fairly fast. As I said, I suspect the long time is due to the repeated string scanning. Just a suggestion, but the other solutions still requires repeated scanning. Your call. – adelphus Jan 20 '12 at 14:22

score 1 · Answer 5 · answered Jan 20 '12 at 12:50

1

I know Regex can do a lot of things but here is a benchmark with Regex vs char.Split vs string.split

http://www.dotnetperls.com/split in the Benchmarks section

answered Jan 20 '12 at 12:50

Guillaume Slashy

3,554
8
43
68

1

I think that site just raped my eyeballs. – adelphus Jan 20 '12 at 13:01

Faster replacement for Regex

5 Answers5