46

Both languages claim to use Perl style regular expressions. If I have one language test a regular expression for validity, will it work in the other? Where do the regular expression syntaxes differ?

The use case here is a C# (.NET) UI talking to an eventual Java back end implementation that will use the regex to match data.

Note that I only need to worry about matching, not about extracting portions of the matched data.

Rex M
  • 142,167
  • 33
  • 283
  • 313
TREE
  • 1,292
  • 2
  • 12
  • 21

6 Answers6

100

There are quite (a lot of) differences.

Character Class

  1. Character classes subtraction [abc-[cde]]
    • .NET YES (2.0)
    • Java: Emulated via character class intersection and negation: [abc&&[^cde]])
  2. Character classes intersection [abc&&[cde]]
    • .NET: Emulated via character class subtraction and negation: [abc-[^cde]])
    • Java YES
  3. \p{Alpha} POSIX character class
    • .NET NO
    • Java YES (US-ASCII)
  4. Under (?x) mode COMMENTS/IgnorePatternWhitespace, space (U+0020) in character class is significant.
    • .NET YES
    • Java NO
  5. Unicode Category (L, M, N, P, S, Z, C)
    • .NET YES: \p{L} form only
    • Java YES:
      • From Java 5: \pL, \p{L}, \p{IsL}
      • From Java 7: \p{general_category=L}, \p{gc=L}
  6. Unicode Category (Lu, Ll, Lt, ...)
    • .NET YES: \p{Lu} form only
    • Java YES:
      • From Java 5: \p{Lu}, \p{IsLu}
      • From Java 7: \p{general_category=Lu}, \p{gc=Lu}
  7. Unicode Block
    • .NET YES: \p{IsBasicLatin} only. (Supported Named Blocks)
    • Java YES: (name of the block is free-casing)
      • From Java 5: \p{InBasicLatin}
      • From Java 7: \p{block=BasicLatin}, \p{blk=BasicLatin}
  8. Spaces, and underscores allowed in all long block names (e.g. BasicLatin can be written as Basic_Latin or Basic Latin)
    • .NET NO
    • Java YES (Java 5)

Quantifier

  1. ?+, *+, ++ and {m,n}+ (possessive quantifiers)
    • .NET NO
    • Java YES

Quotation

  1. \Q...\E escapes a string of metacharacters
    • .NET NO
    • Java YES
  2. \Q...\E escapes a string of character class metacharacters (in character sets)
    • .NET NO
    • Java YES

Matching construct

  1. Conditional matching (?(?=regex)then|else), (?(regex)then|else), (?(1)then|else) or (?(group)then|else)
    • .NET YES
    • Java NO
  2. Named capturing group and named backreference
    • .NET YES:
      • Capturing group: (?<name>regex) or (?'name'regex)
      • Backreference: \k<name> or \k'name'
    • Java YES (Java 7):
      • Capturing group: (?<name>regex)
      • Backreference: \k<name>
  3. Multiple capturing groups can have the same name
    • .NET YES
    • Java NO (Java 7)
  4. Balancing group definition (?<name1-name2>regex) or (?'name1-name2'subexpression)
    • .NET YES
    • Java NO

Assertions

  1. (?<=text) (positive lookbehind)
    • .NET Variable-width
    • Java Obvious width
  2. (?<!text) (negative lookbehind)
    • .NET Variable-width
    • Java Obvious width

Mode Options/Flags

  1. ExplicitCapture option (?n)
    • .NET YES
    • Java NO

Miscellaneous

  1. (?#comment) inline comments
    • .NET YES
    • Java NO

References

nhahtdh
  • 55,989
  • 15
  • 126
  • 162
Drew Noakes
  • 300,895
  • 165
  • 679
  • 742
6

Check out: http://www.regular-expressions.info/refflavors.html Plenty of regex info on that site, and there's a nice chart that details the differences between java & .net.

Seth
  • 45,033
  • 10
  • 85
  • 120
  • +1 good info. If anyone wants to pull out the high-level data from here (named groups, full string v. partial matches, etc) I'll mark that as the answer. – TREE Feb 12 '09 at 14:31
4

c# regex has its own convention for named groups (?<name>). I don't know of any other differences.

Rex M
  • 142,167
  • 33
  • 283
  • 313
  • are named groups used for matching? or for extracting the matched portions after the match? – TREE Feb 11 '09 at 21:00
2

.NET Regex supports counting, so you can match nested parentheses which is something you normally cannot do with a regular expression. According to Mastering Regular Expressions that's one of the few implementations to do that, so that could be a difference.

Brian Rasmussen
  • 114,645
  • 34
  • 221
  • 317
2

Java uses standard Perl type regex as well as POSIX regex. Looking at the C# documentation on regexs, it looks like that Java has all of C# regex syntax, but not the other way around.

Compare them yourself: Java: C#:

EDIT: Currently, no other regex flavor supports Microsoft's version of named capture.

WolfmanDragon
  • 7,851
  • 14
  • 49
  • 61
  • 1
    No, .Net has several features Java lacks, as well as vice-versa. In fact, when it comes to cool features, I'd say .Net has a clear lead. But I think they made a big mistake leaving out possessive quantifiers. – Alan Moore Feb 12 '09 at 06:09
1

From my experience:

Java 7 regular expressions as compared to .NET 2.0 regular expressions:

  • Underscore symbol in group names is not supported

  • Groups with the same name (in the same regular expression) are not supported (although it may be really useful in expressions using "or"!)

  • Groups having captured nothing have value of null and not of an empty string

  • Group with index 0 also contains the whole match (same as in .NET) BUT is not included in groupCount()

  • Group back reference in replace expressions is also denoted with dollar sign (e.g. $1), but if the same expression contains dollar sign as the end-of-line marker - then the back reference dollar should be escaped (\$), otherwise in Java we get the "illegal group reference" error

  • End-of-line symbol ($) behaves greedy. Consider, for example, the following expression (Java-string is given): "bla(bla(?:$|\r\n))+)?$". Here the last line of text will be NOT captured! To capture it, we must substitute "$" with "\z".

  • There is no "Explicit Capture" mode.

  • Empty string doesn't satisfy the ^.{0}$ pattern.

  • Symbol "-" must be escaped when used inside square brackets. That is, pattern "[a-z+-]+" doesn't match string "f+g-h" in Java, but it does in .NET. To match in Java, pattern should look as (Java-string is given): "[a-z+\-]+".

NOTE: "(Java-string is given)" - just to explain double escapes in the expression.

Alexey Y.
  • 123
  • 8