8

Special sequences (character classes) in Python RegEx are escapes like \w or \d that matches a set of characters.

In my case, I need to be able to match all alpha-numerical characters except numbers.

That is, \w minus \d.

I need to use the special sequence \w because I'm dealing with non-ASCII characters and need to match symbols like "Æ" and "Ø".

One would think I could use this expression: [\w^\d] but it doesn't seem to match anything and I'm not sure why.

So in short, how can I mix (add/subtract) special sequences in Python Regular Expressions?


EDIT: I accidentally used [\W^\d] instead of [\w^\d]. The latter does indeed match something, including parentheses and commas which are not alpha-numerical characters as far as I'm concerned.

Hubro
  • 56,214
  • 69
  • 228
  • 381
  • 1
    your expression matches alpha, numbers, and ^, i think. ^ for negating a class should be placed at the begining of the class definition – njzk2 Sep 10 '12 at 12:02

4 Answers4

15

You can use r"[^\W\d]", ie. invert the union of non-alphanumerics and numbers.

Janne Karila
  • 24,266
  • 6
  • 53
  • 94
9

You cannot subtract character classes, no.

Your best bet is to use the regex project, which offers additional functionality while remaining backwards compatible with the re module in in the standard library. It supports character classes based on Unicode properties:

\p{IsAlphabetic}

This will match any character that the Unicode specification states is an alphabetic character.

Even better, regex does support character class subtraction; it views such classes as sets and allows you to create a difference with the -- operator:

[\w--\d]

matches everything in \w except anything that also matches \d.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • Is the `regex` package still set to replace the built-in `re` module? The built-in module is powerful enough for most basic tasks while still being relatively simple to use. The `regex` module is indeed incomparably powerful for all kinds of tasks but the power comes with a price. I hope that such specialized tools will stay independent. For everyone's benefit. – Jeyekomon May 30 '22 at 12:08
  • 1
    @Jeyekomon: No, in the interim the [project has dropped the intention to replace the `re` module](https://github.com/mrabarnett/mrab-regex/commit/47ec54e6ff40afabb3d647c7cd83d35c1d199086). I've updated the wording in my answer to match. – Martijn Pieters May 30 '22 at 12:13
2

You can exclude classes using a negative lookahead assertion, such as r'(?!\d)[\w]' to match a word character, excluding digits. For example:

>>> re.search(r'(?!\d)[\w]', '12bac')
<_sre.SRE_Match object at 0xb7779218>
>>> _.group(0)
'b'

To exclude more than one group, you can use the usual [...] syntax in the lookahead assertion, for example r'(?![0-5])[\w]' would match any alphanumeric character except for digits 0-5.

As with [...], the above construct matches a single character. To match multiple characters, add a repetition operator:

>>> re.search(r'((?!\d)[\w])+', '12bac15')
<_sre.SRE_Match object at 0x7f44cd2588a0>
>>> _.group(0)
'bac'
user4815162342
  • 141,790
  • 18
  • 296
  • 355
1

I don't think you can directly combine (boolean and) character sets in a single regex, whether one is negated or not. Otherwise you could simply have combined [^\d] and \w.

Note: the ^ has to be at the start of the set, and applies to the whole set. From the docs: "If the first character of the set is '^', all the characters that are not in the set will be matched.". Your set [\w^\d] tries to match an alpha-numerical character, followed by a caret, followed by a digit. I can imagine that doesn't match anything either.

I would do it in two steps, effectly combining the regular expressions. First match by non-digits (inner regex), then match by alpha-numerical characters:

re.search('\w+', re.search('([^\d]+)', s).group(0)).group(0)

or variations to this theme.

Note that would need to surround this with a try: except: block, as it will throw an AttributeError: 'NoneType' object has no attribute 'group' in case one of the two regexes fails. But you can, of course, split this single line up in a few more lines.