12

What is PCRE-compatible syntax? And is C# PCRE-compatible? From wikipedia I found this:

Perl Compatible Regular Expressions (PCRE) is a regular expression C library inspired by the regular expression capabilities in the Perl programming language, written by Philip Hazel, starting in summer 1997. PCRE's syntax is much more powerful and flexible than either of the POSIX regular expression flavors and many classic regular expression libraries. The name is misleading, because PCRE and Perl each have capabilities not shared by the other.

Source

Mohamad Shiralizadeh
  • 8,329
  • 6
  • 58
  • 93

1 Answers1

22

C# regexes share some syntax with PCRE regexes. Most of the features overlap but both libraries keep their own specifics:

A couple examples:

PCRE

  • Supports recursion
  • Supports backtrack control verbs
  • Supports constructs like (?(DEFINE) ... )
  • Supports more options
  • Offers a DFA parsing mode
  • Supports partial matches
  • Supports \K
  • Supports X++ shorthand syntax (equivalent of (?>X+))

.NET

  • Supports capture stacks and duplicate named groups
  • Supports balancing groups
  • Supports variable length lookbehind

This list is not exhaustive. You can compare both flavours on this page and the sibling pages.

Given the differences, I wanted to be able to use PCRE regexes from .NET and recently started PCRE.NET, which is a wrapper project. It's not finished yet but is starting to be usable.

Lucas Trzesniewski
  • 50,214
  • 11
  • 107
  • 158
  • As a long term user of the old PCRE wrapper for .NET I was glad to find the one you've been working on (compatible with 4.5.2 which the old one isn't) and doubly glad to find it's equally fast, 3 years ago I was doing a lot of regex processing and switched to PCRE when I realised how appallingly slow native C# regex processing is. Recent example - 100 files through approx 90 regex rules, C# - 190s, PCRE - 4s...... – Simon Sep 22 '16 at 15:58
  • @Simon nice, thanks for the feedback! :) though it heavily depends on the regex involved, I haven't spent enough time on benchmarking yet, but some patterns are definitely faster in .NET. – Lucas Trzesniewski Sep 22 '16 at 16:02
  • Mine are mainly zero width lookahead types, i.e. (?=.*?words)(?=.*?order)(?=.*?in)(?=.*?any). The issue (kind of!) I have now is that I'm processing 3K documents already in memory and old C# had my CPU at 100% but with PCRE it wont go over about 25%, there's no other activity disk etc. and I'm running the 3K in a parallel foreach against the 90 regexes, even adding a load of debug output only pushed the CPU to 45%! Assuming I've hit some physical CPU reading RAM limit that isn't so easy to monitor, assume there is nothing thread limiting in PCRE? – Simon Sep 23 '16 at 07:40
  • @Simon nope, PCRE and PCRE.NET are fully thread-safe. Looks like it's time to use a profiler on your code, it's the only way to know where the bottleneck comes from. Though in your case I think you'd be better off avoiding regexes altogether and using `String.Contains` or maybe you should consider something like Lucene.NET. Also, I hope this regex starts with `^` :) – Lucas Trzesniewski Sep 23 '16 at 07:56
  • Thanks for confirming the threading. Profiler is my next task, I don't use ^ as always run these with Match so assumed (naively?) that covered the start anchoring. A colleague mentioned String.Contains but given these regexs rarely register double digit ElapsedTicks would 4 String.Contains vs. say the previous regex really be faster or is it the short circuiting that helps although I wonder if the regex would do the same if the first word isn't found. – Simon Sep 23 '16 at 08:31
  • @Simon, no, `Match` will try to find a match at any starting position, so you should anchor your pattern either with `^`, `\A` or `PcreOptions.Anchored`. For performance you should also make sure you're using `PcreOptions.Compiled` and `PcreMatchOptions.NoUtfCheck` (don't confuse with `PcreOptions.NoUtfCheck`) if you can ensure your input is a valid Unicode string. And the only way to be sure whether `String.Contains` is faster or not is, well... to measure both approaches with your data ;) – Lucas Trzesniewski Sep 23 '16 at 08:42
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/124016/discussion-between-simon-and-lucas-trzesniewski). – Simon Sep 23 '16 at 09:11