Why does Java-regex matches underscore?

Question

I was trying to match the URL pattern string.string. for any number of string. using ^([^\\W_]+.)([^\\W_]+.)$ as a first attempt, and it works for matching two consecutive patterns. But then, when I generalize it to ^([^\\W_]+.)+$ stops working and matches the wrong pattern "string.str_ing.". Do you know what is incorrect with the second version?

`\w` entails the underscore too. Also since a couple of years URLs may contain Unicode letters. — Joop Eggen, Jun 28 '20 at 19:10

score 0 · Answer 1 · answered Jun 28 '20 at 18:35

With ^([^\\W_]+.)([^\\W_]+.)$ you match any two words with restricted set of characters. Although, you have not escaped the ., it still works as long as the first word is matched first string, then any literal (that's what unescaped . means) and then string again.

In the latter one the unescaped dot (.) is a part of the capturing group occurring at least once (since you use +), therefore it allows any character as a divisor. In other words string.str_ing. is understood as:

string as the 1st word
str as the 2nd word
ing as the 3rd word

... as long as the unescaped dot (.) allows any divisor (both . literally and _).

Escape the dot to make the Regex work as intented (demo):

^([^\\W_]+\.)+$

score 0 · Accepted Answer · answered Jun 28 '20 at 18:36

0

You need to escape your . character, else it will match any character including _.

^([^\\W_]+\.?)+$

this can be your generalised regex

answered Jun 28 '20 at 18:36

Karthik Radhakrishnan

934
6
11

score 0 · Answer 3 · answered Jun 28 '20 at 18:37

0

[^\W] seems a weird choice - it's matching 'not not-a-word-character'. I haven't thought it through, but that sounds like it's equivalent to \w, i.e., matching a word character.

Either way, with ^\W and \w, you're asking to match underscores - which is why it matches the string with the underscore. "Word characters" are uppercase alphabetics, lowercase alphabetics, digits, and underscore.

You probably want [a-z]+ or maybe [A-Za-z0-9]+

answered Jun 28 '20 at 18:37

user13784117

1,124
4
4

Nope, it doesn't. The content of the `[]` says that anything except `/` literally (`//`). The `\W` (it should be `\w` anyway) doesn't work as a shortcut for `[a-zA-Z0-9_]` since the initial two backslashes (`\\`) have own meaning and the `W`/`w` character remains unescaped. The three ones should be included to take an effect (and a lower-case `w`). – Nikolas Charalambidis Jun 28 '20 at 18:40
There are no slashes, only backslashes, in the given regex. I assumed the \\ is just Java source syntax for a single \. Otherwise the expression is just bizarre - [^\\W_]+ matches a string of characters except for backslash, W, and underscore. Which might well have given the result seen, but it doesn't seem like a useful parsing, and I doubt it was intended. – user13784117 Jun 28 '20 at 21:02

Why does Java-regex matches underscore?

3 Answers3