UB: C#'s Regex.Match returns whole string instead of part when matching

Question

Attention! This is NOT related to Regex problem, matches the whole string instead of a part

Hi all. I try to do

Match y = Regex.Match(someHebrewContainingLine, @"^.{0,9} - \[(.*)?\s\d{1,3}");

Aside from the other VS hebrew quirks (how do you like replacing ] for [ when editing the string?), it occasionally returns the crazy results:

Match.Captures.Count = 1;
Match.Captures[0] = whole string! (not expected)
Match.Groups.Count = 2; (not expected)
Match.Groups[0] = whole string again! (not expected)
Match.Groups[1] = (.*)? value (expected).

Regex.Matches() is acting same way.

What can be a general reason for such behaviour? Note: it's not acting this way on a simple test strings like Regex.Match("-היי45--", "-(.{1,5})-") (sample is displayed incorrectly!, please look to the page's source code), there must be something with the regex which makes it greedy. The matched string contains [ .... ], but simply adding them to test string doesn't causes the same effect.

Have you tried using `RegexOptions.RightToLeft`, such as `Regex.Match(input, regex, RegexOptions.RightToLeft);`? — newfurniturey, Aug 10 '12 at 19:34
I noticed that you're not using the end of string signal `$`. Maybe it will help =) — Andre Calil, Aug 10 '12 at 19:35
@newfurniturey - [Jeffery Friedl](http://regex.info/) who authored "Mastering Regular Expressions" claims that this option is buggy in .NET. — Oded, Aug 10 '12 at 19:36
@Oded I've only used it with Arabic text (once) and it worked fine - but I'll take your word for it as I can't prove otherwise! — newfurniturey, Aug 10 '12 at 19:41
@newfurniturey, seems that option is ignored in this case. I really want to sink this RTL support and write own class for handling RTL strings, it's always a problem when they show up in this project. — kagali-san, Aug 10 '12 at 19:42
@AndreCalil, almost the same thing as with .RightToLeft.. would probably upload the sample to let you get some fun with it. — kagali-san, Aug 10 '12 at 19:42
Whoa. After a close inspection of test string, I found that both [] are having same character code - 91. More special effects to come.. — kagali-san, Aug 10 '12 at 20:03
OK, at last: the effect of [] having same character code was due to the bad test string. And the effect of matching will be now described in answer. — kagali-san, Aug 10 '12 at 21:04

Kent · Answer 1 · 2012-08-10T19:51:39.783

6

I hit this problem when I first started using the .NET regex, too. The way to understand this is to understand that the Group member of Match is the nesting member. You have to traverse Groups in order to get down to lower captures. Groups also have Capture members. The Match is kind of like the top "Group" in that it represents the successful "match" of the whole string against your expression. The single input string can have multiple matches. The Captures member represents the match of your full expression.

Whenever you have a single capture as you have, Group[1] will always be the data you are interested in. Look at this page. The source code in examples 2 and 3 is hardcoded to print out Groups[1].

Remember that a single capture can capture multiple substrings in a single match operation. If this were the case then you would see Match.Groups[1].Captures.Count be greater than 1. Also, I think if you passed in multiple matching lines of text to the single Match call, then you would see Match.Captures.Count be greater than 1, but each top-level Match.Captures would be the full string matched by your full expression.

edited Aug 10 '12 at 19:51

answered Aug 10 '12 at 19:46

Kent

1,691
4
19
27

That sounds correct, but then, having single capture group - defined in various regexes applies agains various strings - its expected to have Match.Value (top) to represent what was inside the capture group and not a whole string, and here I got some undefined behaviour. – kagali-san Aug 10 '12 at 20:06
@Kent, to disappoint you a bit: top Match and top Group are acting differently when using lookaheads. Will write a detailed explanation later. It's an undocumented yet not undefined behaviour. – kagali-san Aug 10 '12 at 21:06
@kagali-san To help understand why the behavior is the way it is in the case of your first comment, imagine the results if your expression had no captures. You are calling a method named Match. And it returns an object named Match. You can have an expression with no captures that does successfully match. Thus, the observed behavior. – Kent Aug 11 '12 at 05:56

score 5 · Answer 2 · answered Aug 10 '12 at 19:46

5

There is one capture group in the pattern; that is group 1.

There is always group 0, which is the entire match.

Therefore there are a total of 2 groups.

answered Aug 10 '12 at 19:46

MRAB

20,356
6
40
33

kagali-san · Accepted Answer · 2012-09-02T17:59:26.057

My test regex was different from any others in the project's scope (thats what happens when Perl guy comes to C#), as it had no lookaheads/lookbehinds. So this discovery took some time.

Now, why we should call Regex behaviour undocumented, not undefined:

let's do some matches against "1.234567890".

PCRE-like syntax: (.)\.2345678
lookahead syntax: (.)(?=\.\d)

When you're doing a normal match, the result is copied from whole matched part of line, no matter where you've put the parentesizes; in case of lookaheads present, anything that did not belongs to them is copied.

So, the matches will return:

PCRE: 1.2345678 (at 2300, this looks like original string and I start yelling here at SO)
lookahead: 1

UB: C#'s Regex.Match returns whole string instead of part when matching

3 Answers3

Linked

Related