51

Is there a concise way to express:

\w but without _

That is, "all characters included in \w, except _"

I'm asking this because I'm looking for the most concise way to express domain name validation. A domain name may include lowercase and uppercase letters, numbers, period signs and dashes, but no underscores. \w includes all of the above, plus an underscore. So, is there any way to "remove" an underscore from \w via regex syntax?

Edited: I'm asking about regex as used in PHP.

Thanks in advance!

Joseph Silber
  • 214,931
  • 59
  • 362
  • 292
Dimitri Vorontzov
  • 7,834
  • 12
  • 48
  • 76
  • 6
    Depends on the regex flavour. Which language are you using? The easiest way though would be to just use `[A-Za-z0-9]`. `\w` does (normally) **not** include dashes or periods. – Felix Kling Feb 13 '13 at 16:37
  • 1
    Depending on the flavor `\w` may support Unicode characters. Unless you are totally sure about what `\w` represent, it is best that you use the character class `[]` and list all of them out normally. – nhahtdh Feb 13 '13 at 16:38

8 Answers8

61

the following character class (in Perl)

[^\W_]

\W is the same as [^\w]

protist
  • 1,172
  • 7
  • 9
  • explain to me how it is not....and note that the `?:` part is just saying to not actually capture the group found by the atom – protist Feb 13 '13 at 16:47
  • 1
    @protist: The atom is WRONG. `\w` will match `_`, and `|` is alternation and acts like OR, not AND – nhahtdh Feb 13 '13 at 16:48
  • Sorry, I should have mentioned it before. I'm using PHP. Would that work in PHP? – Dimitri Vorontzov Feb 13 '13 at 17:05
  • I am no expert on PHP, but a little research affirms that PHP does indeed have a `\W` as used in my Perl. This will likely work for you as well. – protist Feb 13 '13 at 17:07
  • So, do I understand this correctly, that [^\W_] is the same as [A-Za-z0-9.-]? – Dimitri Vorontzov Feb 13 '13 at 17:12
  • 1
    I am unsure as to whether `.` and `-` are included, as what is considered a `word` character differs slightly by locale. Some sources say `\w` is equivalent to `[A-Za-z0-9_]` (but are sure to say this is not always true). `[^\W_]` is `\w but without _`, though certainly. – protist Feb 13 '13 at 17:25
  • very creative, Thanx – sami Mar 22 '22 at 03:00
  • 1
    To put it into words think: (not (not word) or underscore) where word is `[a-zA-Z0-9_]` – Adithya Jul 06 '22 at 22:00
18

You could use a negative lookahead: (?!_)\w

However, I think writing [a-zA-Z0-9.-] is more readable.

Felix Kling
  • 795,719
  • 175
  • 1,089
  • 1,143
Bergi
  • 630,263
  • 148
  • 957
  • 1,375
  • 1
    That would be `(?!_)\w`, no? – Zero Piraeus Feb 13 '13 at 16:42
  • Look-around is slower than normal matching. May not matter here, though – nhahtdh Feb 13 '13 at 16:47
  • Thanks a lot, @Bergi - I have a question: wouldn't it be proper to write [a-zA-z0-9\.\-] - escaping period and dash – or is it wrong/unnecessary to escape them in this case? (I'm new to regex, and this may be a silly question...) – Dimitri Vorontzov Feb 13 '13 at 17:14
  • 1
    Not necessary: http://www.regular-expressions.info/charclass.html. Only characters that have a special meaning in a character class (`]\^-`) need to be escaped, and not when unambigous. – Bergi Feb 13 '13 at 17:20
  • Thank you very much, @Bergi! So, looking through the entire body of answers to my question, these solutions would all work: (?!_)\w --- [^\W_] --- or [A-Za-z0-9.-] --- am I right? – Dimitri Vorontzov Feb 13 '13 at 17:23
  • 1
    @Dimitri: Yes, depending on that `\w` means `[a-zA-Z0-9.-_]` in your regex flavour. – Bergi Feb 13 '13 at 17:27
5

To be on the safe side, usually, we will use character class:

[a-zA-Z0-9.-]

The regex "fragment" above match English alphabet, and digits, plus period . and dash -. It should work even with the most basic regex support.

Shorter may be better, but only if you know exactly what it represents.

I don't know what language you are using. In a lot of engines, \w is equivalent to [a-zA-Z0-9_] (some requires "ASCII mode" for this). However, some engine have Unicode support for regex, and may extend \w to match Unicode characters.

nhahtdh
  • 55,989
  • 15
  • 126
  • 162
3

If my understanding is right \w means [A-Za-z0-9_] period signs, dashes are not included.

info: http://en.wikipedia.org/wiki/Regular_expression#POSIX_character_classes

so I guess what you want is [a-zA-Z0-9.-]

Kent
  • 189,393
  • 32
  • 233
  • 301
2

Some regex flavours have a negative lookbehind syntax you might use:

\w(?<!_)
Zero Piraeus
  • 56,143
  • 27
  • 150
  • 160
  • 2
    Negative lookaheads are more widely supported than negative lookbehinds. – Joseph Silber Feb 13 '13 at 16:42
  • 1
    @JosephSilber True. Conceptually, I find "give me a word character ... but not an underscore" slightly easier than "the next thing I want shouldn't be an underscore ... otherwise, give me a word character" to follow, if negative lookbehinds *are* available, though. – Zero Piraeus Feb 13 '13 at 16:49
1

For anybody looking to match [^a-zA-Z0-9]+ can be written short as [\W^_]+ (in Python)

However, micro performance might be worse because with \W first you match [^a-zA-Z0-9_] and then you unmatch _.

def camelCaseNotation(value):
    """Select all symbolic character plus the next alphabetical character. Remove symbols and uppercases the alphabetic character."""
    return re.sub(r"[\W^_]+([\w]{0,1})", lambda m: m.group(1).upper(), value)
0

I would start with [^_], and then think of what else characters I need to deny. If you need to filter a keyboard input, it's quite simple to enumerate all the unwanted characters.

Zoltán Tamási
  • 12,249
  • 8
  • 65
  • 93
  • 2
    This is a very poor approach. Domain name has a defined set of allowed characters, so white-listing can be done. When you black list, you need to care about what Unicode character you need to deny also. – nhahtdh Feb 13 '13 at 16:50
  • @nhahtdh, I've taken into count that doamin names CAN have unicode characters (for example accented vowels). So I think it's quite hard to precisely form an ultimate correct white list solution. – Zoltán Tamási Feb 13 '13 at 17:25
  • There is specs for that - it is troublesome, but defined. People tend to forgot/overlook things when blacklisting. – nhahtdh Feb 13 '13 at 17:28
  • I agree, that's why I mentioned if the case is a keyboard input, because that can simplify things IMHO. – Zoltán Tamási Feb 13 '13 at 17:34
0

You can write something like this:

\([^\w]|_)\u

If you use preg_filter with this string any character in \w (excluding _ underscore) will be filtered.

MrD
  • 2,423
  • 3
  • 33
  • 57