29

Is there a simple way to match all characters in a class except a certain set of them? For example if in a lanaguage where I can use \w to match the set of all unicode word characters, is there a way to just exclude a character like an underscore "_" from that match?

Only idea that came to mind was to use negative lookahead/behind around each character but that seems more complex than necessary when I effectively just want to match a character against a positive match AND negative match. For example if & was an AND operator I could do this...

^(\w&[^_])+$
melpomene
  • 84,125
  • 8
  • 85
  • 148
Dan Roberts
  • 4,664
  • 3
  • 34
  • 43
  • 4
    Which flavor of regex are you using? (e.g. Perl, Java, etc.) – Thomas Langston Jun 26 '13 at 18:30
  • What regex flavor/language? http://stackoverflow.com/q/3201689/139010 – Matt Ball Jun 26 '13 at 18:30
  • 1
    In .NET you could use `[\w-[_]]` to exclude the underscore. – HamZa Jun 26 '13 at 18:33
  • The regex engine I use most frequently is java based though an old implementation (whatever CF8 uses under the hood). However I also have this need in javascript and python. – Dan Roberts Jun 26 '13 at 18:44
  • You mean ColdFusion? That's based on JavaScript, not Java. And its `\w` only recognizes the ASCII word characters (`[A-Za-z0-9_]`), not the full Unicode set. Same goes for Python's built-in `re` flavor. – Alan Moore Jun 26 '13 at 20:26
  • Perl solutions are found [here](https://stackoverflow.com/q/69633772/589924). – ikegami Oct 19 '21 at 17:45

5 Answers5

28

It really depends on your regex flavor.

.NET

... provides only one simple character class set operation: subtraction. This is enough for your example, so you can simply use

[\w-[_]]

If a - is followed by a nested character class, it's subtracted. Simple as that...

Java

... provides a much richer set of character class set operations. In particular you can get the intersection of two sets like [[abc]&&[cde]] (which would give c in this case). Intersection and negation together give you subtraction:

[\w&&[^_]]

Perl

... supports set operations on extended character classes as an experimental feature (available since Perl 5.18). In particular, you can directly subtract arbitrary character classes:

(?[ \w - [_] ])

All other flavors

... (that support lookaheads) allow you to mimic the subtraction by using a negative lookahead:

(?!_)\w

This first checks that the next character is not a _ and then matches any \w (which can't be _ due to the negative lookahead).

Note that each of these approaches is completely general in that you can subtract two arbitrarily complex character classes.

melpomene
  • 84,125
  • 8
  • 85
  • 148
Martin Ender
  • 43,427
  • 11
  • 90
  • 130
13

You can use a negation of the \w class (--> \W) and exclude it:

^([^\W_]+)$
Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125
  • Creative, but I don't think the OP expected this kind of answer, he wants to exclude a character in a general case. Nice idea though – HamZa Jun 26 '13 at 18:40
  • @CasimiretHippolyte I should have thought of this. HamZa is right that I was looking for a more general case, but woah... \p... thank you for pointing that out as I have never used it. – Dan Roberts Jun 26 '13 at 18:50
  • @CasimiretHippolyte not *all* cases. This cannot be used to exclude a character from a range ;). – Martin Ender Jun 26 '13 at 18:52
  • Not all RE engines support that. – Donal Fellows Jun 26 '13 at 19:43
  • @DonalFellows what do you mean by "that"? Negated character classes? – Martin Ender Jun 26 '13 at 19:52
  • This works great, but only with a *single* class *except some characters* (e.g. `\w` without `_`), *not* with *multiple* classes *except some characters* (e.g. `\w` and `\p{P}` without `_`). – caw Jan 16 '22 at 10:37
  • @caw: your example is out of the scope of the question, and except for regex flavors that allows operations inside character classes (intersections, substractions), I doubt there's a miraculous solution (without to use your little fingers to build it with ranges) . However, for your particular example, you can do that with pcre in unicode mode: `[[:alnum:]\pP]` or `[\p{Xan}\pP]` . In other words, you have to find the best solution for each case with the predefined classes available. – Casimir et Hippolyte Jan 16 '22 at 13:26
  • @CasimiretHippolyte This was not criticism of your answer. On the contrary, I upvoted it and agree that it’s the perfect answer for this specific question. My comment was just intended as advice for people with adjacent problems. – caw Jan 17 '22 at 08:53
  • @caw: sorry if my answer looks rude, I am not ~totally~fluent~ in english. Your comments are welcome and the critics too. Thanks for "the perfect answer", other answers are useful too. – Casimir et Hippolyte Jan 17 '22 at 20:28
11

A negative lookahead is the correct way to go insofar as I understand your question:

^((?!_)\w)+$
Denis de Bernardy
  • 75,850
  • 13
  • 131
  • 154
7

This can be done in python with the regex module. Something like:

import regex as re
pattern = re.compile(r'[\W_--[ ]]+')
cleanString = pattern.sub('', rawString)

You'd typically install the regex module with pip:

pip install regex

EDIT:

The regex module has two behaviours, version 0 and version 1. Set substraction (as above) is a version 1 behaviour. The pypi docs claim version 1 is the default behaviour, but you may find this is not the case. You can check with

import regex
if regex.DEFAULT_VERSION == regex.VERSION1:
  print("version 1")

To set it to version 1:

regex.DEFAULT_VERSION = regex.VERSION1

or to use version one in a single expression:

pattern = re.compile(r'(?V1)[\W_--[ ]]+')
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
drevicko
  • 14,382
  • 15
  • 75
  • 97
5

Try using subtraction:

[\w&&[^_]]+

Note: This will work in Java, but might not in some other Regex engine.

Rohit Jain
  • 209,639
  • 45
  • 409
  • 525