2

Yet another question about a regex.

I'm trying to match all special characters, except '*'.

So if I match my regex against:

John%%%* dadidou

I should get:

John* dadidou

Here: How to match with regex all special chars except "-" in PHP?

The accepted answer advices to use (if I want to exclude '-'):

[^\w-]

But doesn't that mean: "NOT a special character, NOT -", which is a bit redundant ?

Community
  • 1
  • 1
JPFrancoia
  • 4,866
  • 10
  • 43
  • 73

2 Answers2

5

What you really want is this regex for matching:

[^\w\s*]+

Replace it by empty string.

Which means match 1 or more of any character that is:

  1. Not a word character [AND]
  2. Not a whitespace [AND]
  3. Not a literal *

RegEx Demo

anubhava
  • 761,203
  • 64
  • 569
  • 643
  • 3
    `\w` will match `_`(_special character_) too. `[^A-Za-z0-9\s*]+` – Tushar Feb 13 '16 at 17:16
  • 1
    Yes it matches `_` as well but not sure if OP wants to remove `_` also or keep it – anubhava Feb 13 '16 at 17:17
  • 1
    Just fyi if OP wants to remove `_` as well then use: [`_|[^\w\s*]+`](https://regex101.com/r/fX7nC6/2) regex – anubhava Feb 13 '16 at 17:37
  • Yes, I also want to remove _. Why isn't it matched by the special characters condition btw ? – JPFrancoia Feb 13 '16 at 18:05
  • 1
    `\w` does not match special characters, it matches *word* characters. Somehow you keep confusing these two. Because of the negation `^` it matches "not-word characters". The definition of what set "word characters" match can be found on sites describing regular expressions (with the usual caveat that it depends on the implementation). – Jongware Feb 13 '16 at 19:52
2

When you define a negative character class, you are really inverting it.

What does that mean ?

A positive character class implicitly OR's it's contents.

When you negate a class, you implicitly AND it's contents.

So, [\w-] means word OR dash,
the inverse, [^\w-] means not word AND not dash.

A negative word for instance, [^\w] would match a dash -.
So, to not match it, you have to add a not dash as well.

A C analogy would be

existing (varA || varB)
inverted (!varA && !varB)

where inverting changes the Boolean of each of the components.

Basically a negative class changes the Boolean of each of its components,
so the implicit OR becomes an implicit AND and the components characters
(or expressions) are negated.


What will really bake your noodle later on is when you see something like
[^\S\r\n]

This translates to NOT-NOT-Whitespace and NOT-cr and NOT-lf
which reduces to matching all whitespace except CR,LF