Can I mix character classes in Python RegEx?

Question

Special sequences (character classes) in Python RegEx are escapes like \w or \d that matches a set of characters.

In my case, I need to be able to match all alpha-numerical characters except numbers.

That is, \w minus \d.

I need to use the special sequence \w because I'm dealing with non-ASCII characters and need to match symbols like "Æ" and "Ø".

One would think I could use this expression: [\w^\d] but it doesn't seem to match anything and I'm not sure why.

So in short, how can I mix (add/subtract) special sequences in Python Regular Expressions?

EDIT: I accidentally used [\W^\d] instead of [\w^\d]. The latter does indeed match something, including parentheses and commas which are not alpha-numerical characters as far as I'm concerned.

your expression matches alpha, numbers, and ^, i think. ^ for negating a class should be placed at the begining of the class definition — njzk2, Sep 10 '12 at 12:02

score 15 · Accepted Answer · answered Sep 10 '12 at 10:11

15

You can use r"[^\W\d]", ie. invert the union of non-alphanumerics and numbers.

answered Sep 10 '12 at 10:11

Janne Karila

24,266
6
53
94

4

Note that you need to set `re.UNICODE` for this to match `æ` and other non-ASCII characters. The OP probably already does this, but it bears stating. – Martijn Pieters Sep 10 '12 at 10:25
In this case, how do I add specific characters to the character class, e.g. spaces or commas? – Hubro Sep 10 '12 at 11:18
@Codemonkey You can use a non-capturing group and `|`: `(?:[^\W\d]|[, ])` – Janne Karila Sep 10 '12 at 11:23

Martijn Pieters · Answer 2 · 2022-05-30T12:14:57.073

9

You cannot subtract character classes, no.

Your best bet is to use the regex project, which offers additional functionality while remaining backwards compatible with the re module in in the standard library. It supports character classes based on Unicode properties:

\p{IsAlphabetic}

This will match any character that the Unicode specification states is an alphabetic character.

Even better, regex does support character class subtraction; it views such classes as sets and allows you to create a difference with the -- operator:

[\w--\d]

matches everything in \w except anything that also matches \d.

edited May 30 '22 at 12:14

answered Sep 10 '12 at 09:48

Martijn Pieters

1,048,767
296
4,058
3,343

Is the `regex` package still set to replace the built-in `re` module? The built-in module is powerful enough for most basic tasks while still being relatively simple to use. The `regex` module is indeed incomparably powerful for all kinds of tasks but the power comes with a price. I hope that such specialized tools will stay independent. For everyone's benefit. – Jeyekomon May 30 '22 at 12:08
1

@Jeyekomon: No, in the interim the [project has dropped the intention to replace the `re` module](https://github.com/mrabarnett/mrab-regex/commit/47ec54e6ff40afabb3d647c7cd83d35c1d199086). I've updated the wording in my answer to match. – Martijn Pieters May 30 '22 at 12:13

user4815162342 · Answer 3 · 2012-09-10T17:13:48.527

2

You can exclude classes using a negative lookahead assertion, such as r'(?!\d)[\w]' to match a word character, excluding digits. For example:

>>> re.search(r'(?!\d)[\w]', '12bac')
<_sre.SRE_Match object at 0xb7779218>
>>> _.group(0)
'b'

To exclude more than one group, you can use the usual [...] syntax in the lookahead assertion, for example r'(?![0-5])[\w]' would match any alphanumeric character except for digits 0-5.

As with [...], the above construct matches a single character. To match multiple characters, add a repetition operator:

>>> re.search(r'((?!\d)[\w])+', '12bac15')
<_sre.SRE_Match object at 0x7f44cd2588a0>
>>> _.group(0)
'bac'

edited Sep 10 '12 at 17:13

answered Sep 10 '12 at 10:08

user4815162342

141,790
18
296
355

This only works for one-letter combos; you'd have to group this in a larger group to work at all. – Martijn Pieters Sep 10 '12 at 10:14
1

Sure, but that's the case with [...] matching as well. I'll update the answer to state it explicitly. – user4815162342 Sep 10 '12 at 17:09

score 1 · Answer 4 · answered Sep 10 '12 at 10:06

I don't think you can directly combine (boolean and) character sets in a single regex, whether one is negated or not. Otherwise you could simply have combined [^\d] and \w.

Note: the ^ has to be at the start of the set, and applies to the whole set. From the docs: "If the first character of the set is '^', all the characters that are not in the set will be matched.". Your set [\w^\d] tries to match an alpha-numerical character, followed by a caret, followed by a digit. I can imagine that doesn't match anything either.

I would do it in two steps, effectly combining the regular expressions. First match by non-digits (inner regex), then match by alpha-numerical characters:

re.search('\w+', re.search('([^\d]+)', s).group(0)).group(0)

or variations to this theme.

Note that would need to surround this with a try: except: block, as it will throw an AttributeError: 'NoneType' object has no attribute 'group' in case one of the two regexes fails. But you can, of course, split this single line up in a few more lines.

Can I mix character classes in Python RegEx?

4 Answers4

Linked

Related