Regular expression "[ ]" work to weed out white spaces, but why and how?

Question

To extract the text between the pattern >>Digit<<, I have successfully used regex "(?<=\>>[0-9]+?<<)[ ].+?(?=\>>[0-9]+?<<)". Regex option is set to single line because the to-be-extracted text may be multiline.

>>1<< First Option For Third Variable Reply1 >>1<<

>>2<< Second Option For Third Variable Reply 1 >>2<<

>>3<< Third Option For Third Variable Reply 1 
>>3<<

If I remove the [ ] portion of the regex "(?<=\>>[0-9]+?<<).+?(?=\>>[0-9]+?<<)", matches using the regex will actually extract white spaces (e.g. between >>1<< and >>2<) which is not my intent. I don't understand why adding [ ] excludes those white spaces.

I understand that square brackets in regex generally signify character classes that are to be included. But here, by inserting square brackets with a space, I manage to exclude the white spaces (e.g. between >>1<< and >>2<). So I am trying to understand how it worked in my case.

Thank you.

Don't ever use regular expressions, no matter how and why. Always write your own algo for parsing text. — , Nov 29 '17 at 13:07
@AlexDepler Thanks for your comment. I have read this and it is very illuminating. I will think of writing ,y own algos to parse text when I have some time. https://stackoverflow.com/questions/7553722/when-should-i-not-use-regular-expressions — B T S T, Nov 29 '17 at 13:53
I'd like to put it a lot simpler (than Wiktor's answer) by simply saying - your `[ ]` (which b.t.w. is exactly the same as a space without the brackets) ensures the *tag* (`>>n<<`) is followed by a space, and not any character (`.`) which matches the newline character after a terminating *tag*. You could easily solve it by using a capture group to extract the text and *consume* the *tags* (by **not** having it as a look-arounds). [Illustrated here at regex101](https://regex101.com/r/BkoTrD/1). — SamWhan, Nov 29 '17 at 14:07
Thank you for directly answering the question. It sure clears up what i am trying to understand. — B T S T, Nov 29 '17 at 15:48

Wiktor Stribiżew · Answer 1 · 2017-11-29T12:58:14.717

The point is that there are whitespaces between >>2<< and >>3<< and they are matched with .+? when the singleline mode is on.

You may try to use a capturing group around the first digit pattern and use a backreference to match the same number on the right:

(?<=>>([0-9]+)<<).*?(?=>>\1<<)

See the regex demo

Details

(?<=>>([0-9]+)<<) - a positive lookbehind making sure there is >>, 1+ digits (Group 1), << immediately to the left of the current location
.*? - any 0+ chars, as few as possible
(?=>>\1<<) - a positive lookahead making sure there is >>, same number as in Group 1, << immediately to the right of the current location.

See the C# demo:

var s = ">>1<< First Option For Third Variable Reply1 >>1<<\n\n>>2<< Second Option For Third Variable Reply 1 >>2<<\n\n>>3<< Third Option For Third Variable Reply 1 \n>>3<<";
var rx = @"(?<=>>([0-9]+)<<).*?(?=>>\1<<)";
var results = Regex.Matches(s, rx, RegexOptions.Singleline)
            .Cast<Match>()
            .Select(m => m.Value);
Console.WriteLine(string.Join("\n", results));

Result:

 First Option For Third Variable Reply1 
 Second Option For Third Variable Reply 1 
 Third Option For Third Variable Reply 1

Another idea is to disallow whitespaces only between the >>...<< patterns:

(?<=>>[0-9]+<<)(?!\s+>>[0-9]+<<).*?(?=>>[0-9]+<<)
                ^^^^^^^^^^^^^^^^

See this regex demo

Regular expression "[ ]" work to weed out white spaces, but why and how?

1 Answers1