IOWs, the negated form of the \w character class. And should I expect different behavior from the different languages I'm using the regex in?
-
What do you mean IOWs? And I think the negated form of \w is \W.. – Mar 02 '13 at 21:36
2 Answers
Of course does \W
include \r
and \n
.
\W
is the negation of \w
and \w
contains letters, digits and connecting punctuation characters (like the underscore).
There are now 3 possibilities:
\w
is ASCII based ==>[a-zA-Z0-9_]
\w
is Unicode based ==> something like[\p{L}\p{Nd}\p{Pc}]
means letters, digits from all languages and some more characters similar to the underscore See Unicode on regular-expressions.infoThe flavour allows you to switch the behaviour of
\w
with a modifier.
But since newline characters are never included in \w
they are in all cases included in \W
-
then 'split /[\W+\n+\r+]/, $multi_line_string;' should be equal to 'split /\W+/, $multi_line_string;'? As I'm getting a different number of results from each of these. – Jim Black Mar 02 '13 at 22:40
-
This comes probably from `[\W+\n+\r+]`, you need to put the quantifier behind the character class, this way you add the + to the class. Compare `[\W\n\r]+` and `\W+` – stema Mar 02 '13 at 22:43
\w is a short-hand for [a-zA-Z0-9_]
so it will match only a-z (lower and upper), digits and underscore. The negated \w is \W will match everything besides \w
Read here more.
Basically there are 2 types of regex, POSIX and Perl. Theoretically posix regex should act same independent of programming language, but there are some known exceptions. See this thread for differences between Java and .NET (theoretically same posix, practically not same) Are Java and C# regular expressions compatible?