The re.A
flag only affects what shorthand character classes match.
In Python 3.x, shorthand character classes are Unicode aware, the Python 2.x re.UNICODE
/re.U
is ON by default. That means:
\d
: Matches any Unicode decimal digit (that is, any character in Unicode character category [Nd])
\D
: Matches any character which is not a decimal digit. (So, all characters other than those in the Nd
Unicode category).
\w
- Matches Unicode word characters; this includes most characters that can be part of a word in any language, as well as numbers and the underscore. (So, \w+
matches each word in a My name is Виктор
string)
\W
- Matches any character which is not a word character. This is the opposite of \w
. (So, it will not match any Unicode letter or digit.)
\s
- Matches Unicode whitespace characters (it will match NEL
, hard spaces, etc.)
\S
- Matches any character which is not a whitespace character. (So, no match for NEL
, hard space, etc.)
\b
- word boundaries match locations between Unicode letters/digits and non-letters/digits or start/end of string.
\B
- non-word boundaries match locations between two Unicode letters/digits, two non-letters/digits or between a Unicode non-letter/digit and start/end of string.
If you want to disable this behavior, you use re.A
or re.ASCII
:
Make \w
, \W
, \b
, \B
, \d
, \D
, \s
and \S
perform ASCII-only matching instead of full Unicode matching. This is only meaningful for Unicode patterns, and is ignored for byte patterns. Corresponds to the inline flag (?a)
.
That means that:
\d
= [0-9]
- and no longer matches Hindi, Bengali, etc. digits
\D
= [^0-9]
- and matches any characters other than ASCII digits (i.e. it acts as (?u)(?![0-9])\d
now)
\w
= [A-Za-z0-9_]
- and it only matches ASCII words now, Wiktor
is matched with \w+
, but Виктор
does not
\W
= [^A-Za-z0-9_]
- it matches any char but ASCII letters/digits/_
(i.e. it matches 你好吗
, Виктор
, etc.
\s
= [ \t\n\r\f\v]
- matches a regular space, tab, linefeed, carriage return, form feed and a vertical tab
\S
= [^ \t\n\r\f\v]
- matches any char other than a space, tab, linefeed, carriage return, form feed and a vertical tab, so it matches all Unicode letters, digits and punctuation and Unicode (non-ASCII) whitespace. E.g., re.sub(r'\S+', r'{\g<0>}', '\xA0 ', flags=re.A)
will return '{ } '
, as you see, the \S
now matches hard spaces.