5

The following returns true

Regex.IsMatch("FooBar\n", "^([A-Z]([a-z][A-Z]?)+)$");

so does

Regex.IsMatch("FooBar\n", "^[A-Z]([a-z][A-Z]?)+$");

The RegEx is in SingleLine mode by default, so $ should not match \n. \n is not an allowed character.

This is to match a single ASCII PascalCaseWord (yes, it will match a trailing Cap)

Doesn't work with any combinations of RegexOptions.Multiline | RegexOptions.Singleline

What am I doing wrong?

CodeScrubber
  • 143
  • 9
  • 3
    On Windows a new line is \r\n, not \n. – Gusman Jun 15 '17 at 19:22
  • Yes but .NET's RegEx implementation matches it. For some strange reason, look at the docs. – CodeScrubber Jun 15 '17 at 19:23
  • Yes, you're right, it treats \n as newline, so the Regex is checking against "FooBar" only, that's why it matches. Not sure why it treats \n as a new line, may be to add compatibility with other OSes... – Gusman Jun 15 '17 at 19:28
  • It shouldn't be. Regex​Options.​Multiline Field Namespace: System.Text.RegularExpressions Assemblies: System.Text.RegularExpressions.dll, System.dll, netstandard.dll Multiline mode. Changes the meaning of ^ and $ so they match at the beginning and end, respectively, of any line, and not just the beginning and end of the entire string. For more information, see the "Multiline Mode" section in the Regular Expression Options topic. – CodeScrubber Jun 15 '17 at 19:45

2 Answers2

4

In .NET regex, the $ anchor (as in PCRE, Python, PCRE, Perl, but not JavaScript) matches the end of line, or the position before the final newline ("\n") character in the string.

See this documentation:

$   The match must occur at the end of the string or line, or before \n at the end of the string or line. For more information, see End of String or Line.

No modifier can redefine this in .NET regex (in PCRE, you can use D PCRE_DOLLAR_ENDONLY modifier).

You must be looking for \z anchor: it matches only at the very end of the string:

\z   The match must occur at the end of the string only. For more information, see End of String Only.

A short test in C#:

Console.WriteLine(Regex.IsMatch("FooBar\n", @"^[A-Z]([a-z][A-Z]?)+$"));  // => True
Console.WriteLine(Regex.IsMatch("FooBar\n", @"^[A-Z]([a-z][A-Z]?)+\z")); // => False
Graham
  • 7,431
  • 18
  • 59
  • 84
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
1

From wikipedia:

$ Matches the ending position of the string or the position just before a string-ending newline. In line-based tools, it matches the ending position of any line.

So you are asking whether there is a capital letter after the start of the beginning of the string, followed by any number of times (zero or one letter), followed by the end of the string, or the position just before the newline.

That all seems true.

And yes, there seems to be some mismatch between different documentation sources about what is regarded as newline and how $ works or should work exactly. It always brings to mind the wisdom:

Sometimes a man has a problem and he figures he will use a regex to solve it.
Now the man has two problems.

oerkelens
  • 5,053
  • 1
  • 22
  • 29
  • No, it shouldn't, on WIndows a new line must be \r\n, not \n, so the last char on the line is \n – Gusman Jun 15 '17 at 19:27
  • @Gusman `Regex.IsMatch("FooBar\n\n", "^[A-Z]([a-z][A-Z]?)+$", RegexOptions.Singleline)` (two newlines) returns false. With MultiLine, it returns true. I think he's right. IIRC, treating plain `'\n'` as a newline, for compatibility with UNIX, is an ancient convention in MS-land. In C you'd write `\n` to a `FILE *` opened in text mode and it'd actually write `\r\n` to the file. It's `\r\n` in a file, but in a buffer it could be `\n`. – 15ee8f99-57ff-4f92-890c-b56153 Jun 15 '17 at 19:29
  • @EdPlunkett Yes, he is right but shouldn't be right, that's what I meant XD. The problem is the definition of "new line", on windows a "new line" is CR+LF, but the regex is treating LF as "new line" like on *nix – Gusman Jun 15 '17 at 19:32
  • 1
    @Gusman [Au contraire](https://msdn.microsoft.com/en-us/library/wyssk1bs.aspx): *"_read returns the number of bytes read, which might be less than count if there are fewer than count bytes left in the file or **if the file was opened in text mode**, in which case **each carriage return–line feed (CR-LF) pair is replaced with a single linefeed character**."* Same is true of buffered IO functions. – 15ee8f99-57ff-4f92-890c-b56153 Jun 15 '17 at 19:35
  • Look at the RegEx Doc'n on MSDN. It surpised me, too. RegexOptions.Singleline changes the behavior of the '.' character. RegexOptions.Multiline changes the behavior of the '^' and '$' character like Ed says above. Problem is, the Docs say this defualts to beginning and end of the string. – CodeScrubber Jun 15 '17 at 19:38
  • @oerkelens according to MSDN $ should mean End of string. That's what baffles me. – CodeScrubber Jun 15 '17 at 19:42
  • @CodeScrubber So MSDN is wrong. Submit a report. I'm seeing the same behavior you are. If I add a second newline, it ceases to match. No body of information the size of MSDN can be without errors. – 15ee8f99-57ff-4f92-890c-b56153 Jun 15 '17 at 19:49
  • 1
    @EdPlunkett: See [MS DOTNET regex documenation about anchors](https://learn.microsoft.com/en-us/dotnet/standard/base-types/anchors-in-regular-expressions). It is not easy to find, but the behavior is documented. – Wiktor Stribiżew Jun 15 '17 at 20:04
  • @Ed Plunkett Yep. \n\n \r\n work as expected, I think the docs are right and the code is wrong. Thanks for the help. – CodeScrubber Jun 15 '17 at 20:09
  • @Wiktor Stribiżew Thanks! That explains it. I still think someone is "documenting away the problem".... – CodeScrubber Jun 15 '17 at 20:14
  • @CodeScrubber: It is not the problem, it is a legacy design. Regex used in .NET has its origins in Perl regex, and `$` was used in Perl to match the end of string, or the location before a final newline. Why? In Perl, the most common way to read in a text file is line by line. When a line is read, it is read with the newline char at the end. There is a [`chomp` command](https://perldoc.perl.org/functions/chomp.html), but developers are lazy, so, `$` allowed to match before the final newline. – Wiktor Stribiżew Jun 15 '17 at 20:20