3

Is there a regex pattern for .NET that will match any character that will result in multiple lines, i.e. any vertical whitespace character, like perl regex does with \v? In other words, is there a way to match \r (carriage return), \n (line feed), \v (vertical tab), and \f (form feed) as well as the Unicode characters U+0085 (next line), U+2028 (line separator), and U+2029 (paragraph separator) and any other characters I'm not aware of that might result in more than one line?

I'm writing some validation code in .NET that will fail if a user has provided input text that contains more than one line. In most cases, that means I just have to check for \r and \n. However, I know there is a multitude of other vertical whitespace characters.

I know .NET regex differs from perl regex, most importantly in that \v in .NET matches "vertical tab" whereas it matches "vertical whitespace" in perl regex.

Community
  • 1
  • 1
rory.ap
  • 34,009
  • 10
  • 83
  • 174
  • The [perl docs](http://perldoc.perl.org/perlrecharclass.html) on re classes claim that the perl regex engine behaves as desired on `\v` (I haven't checked that empirically). – collapsar Feb 26 '15 at 13:36
  • What dialect of regex are you using? What is your programming language (and version)? Does perl's `\v` work for you? (See https://stackoverflow.com/questions/12290224/is-n-a-vertical-whitespace-i-e-should-v-match-it for a list of characters n that class.) – Christopher Creutzig Feb 26 '15 at 13:36
  • According to the [Unicode character property for whitespace cited on Wikipedia](https://en.wikipedia.org/wiki/Whitespace_character), the list behnind the link in @ChristopherCreutzig's comment is exhaustive (\x0a-\x0d, \x85, u+2028, u+2029). – collapsar Feb 26 '15 at 13:43
  • 1
    You can use `\R` in Perl that stands for a generic newline. – Casimir et Hippolyte Feb 26 '15 at 13:44

2 Answers2

3

As you say, the Perl character class \v matches [\x0A-\x0D] (linefeed, vertical tab, form feed and carriage-return (although I would dispute that CR is vertical white space)) in addition to the non-ASCII code points [\x{2028}\x{2029}] (line separator and paragraph separator).

You can hand-build this character class in .NET like this

[\u0A-\u0D\u2028\u2029]
Borodin
  • 126,100
  • 9
  • 70
  • 144
  • @roryap: I'm pleased to help. I modified your question by moving the last paragraph to the beginning. I think it still reads well. It seemed that people who hadn't read to the end were getting the wrong end of the stick. – Borodin Feb 26 '15 at 16:06
  • @roryap: You might think it more readable written as `[\r\v\f\n\u2028\u2029]`. I don't think there's a way of adding characters to a .NET string using their Unicode names, which would have been ideal. – Borodin Feb 26 '15 at 16:08
  • Thanks again. Too bad .NET doesn't have a single char escape like perl does with `\v` for vertical whitespace, but, as you pointed out that's only for ASCII and wouldn't include the two additional Unicode characters `\u2028` and `u2029`. – rory.ap Feb 26 '15 at 16:11
  • @roryap: You misunderstand me: those six code points are what Perl's `\v` character class matches. A literal Perl equivalent would be `[\r\N{VERTICAL TABULATION}\f\n\N{LINE SEPARATOR}\N{PARAGRAPH SEPARATOR}]`. It's not a definitive list of vertical whitespace characters, and you may want to add to it if you find other characters that cause trouble. – Borodin Feb 26 '15 at 16:16
  • Ahh, I misread it then. You said "As you say, the Perl character class \v matches [\x0A-\x0D]" which I interpreted as being exhaustive, but you then said "And if you're also interested in non-ASCII code points **then it includes** [\x{2028}\x{2029}]" – rory.ap Feb 26 '15 at 16:18
0

If one wants to match any unknowns simply us the not set [^ ] (at least in .Net, my perl is a little hazy) to match up to a specific character. For example if I wanted to match whitespace between from a current position across a line to the next line which starts with the letter D I would use this

([^D]+)

So the match capture would include every type of whitespace there is up to the letter D.

ΩmegaMan
  • 29,542
  • 12
  • 100
  • 122