Categories Visible from the Shell
The code shows all the of the "CATEGORIES". The ones marked "IN" are character categories (the others mark specific slice points between characters):
>>> from pprint import pprint
>>> import sre_parse
>>> pprint(sre_parse.CATEGORIES)
{'\\A': (AT, AT_BEGINNING_STRING),
'\\B': (AT, AT_NON_BOUNDARY),
'\\D': (IN, [(CATEGORY, CATEGORY_NOT_DIGIT)]),
'\\S': (IN, [(CATEGORY, CATEGORY_NOT_SPACE)]),
'\\W': (IN, [(CATEGORY, CATEGORY_NOT_WORD)]),
'\\Z': (AT, AT_END_STRING),
'\\b': (AT, AT_BOUNDARY),
'\\d': (IN, [(CATEGORY, CATEGORY_DIGIT)]),
'\\s': (IN, [(CATEGORY, CATEGORY_SPACE)]),
'\\w': (IN, [(CATEGORY, CATEGORY_WORD)])
The entries with "CATEGORY" are the character categories
This also answers the question of what \w
stands for. It is a "word character". See also: In regex, what does \w* mean?
Categories Explained in the Docs
This is in the output of print(re.__doc__)
. It explains the intended meaning of each category:
The special sequences consist of "\\" and a character from the list
below. If the ordinary character is not on the list, then the
resulting RE will match the second character.
\number Matches the contents of the group of the same number.
\A Matches only at the start of the string.
\Z Matches only at the end of the string.
\b Matches the empty string, but only at the start or end of a word.
\B Matches the empty string, but not at the start or end of a word.
\d Matches any decimal digit; equivalent to the set [0-9] in
bytes patterns or string patterns with the ASCII flag.
In string patterns without the ASCII flag, it will match the whole
range of Unicode digits.
\D Matches any non-digit character; equivalent to [^\d].
\s Matches any whitespace character; equivalent to [ \t\n\r\f\v] in
bytes patterns or string patterns with the ASCII flag.
In string patterns without the ASCII flag, it will match the whole
range of Unicode whitespace characters.
\S Matches any non-whitespace character; equivalent to [^\s].
\w Matches any alphanumeric character; equivalent to [a-zA-Z0-9_]
in bytes patterns or string patterns with the ASCII flag.
In string patterns without the ASCII flag, it will match the
range of Unicode alphanumeric characters (letters plus digits
plus underscore).
With LOCALE, it will match the set [0-9_] plus characters defined
as letters for the current locale.
\W Matches the complement of \w.
\\ Matches a literal backslash.
Other Special Character Groups
Besides the short-hand character classes, the sre_parse module details other interesting character groups as well:
SPECIAL_CHARS = ".\\[{()*+?^$|"
REPEAT_CHARS = "*+?{"
DIGITS = frozenset("0123456789")
OCTDIGITS = frozenset("01234567")
HEXDIGITS = frozenset("0123456789abcdefABCDEF")
ASCIILETTERS = frozenset("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ")
WHITESPACE = frozenset(" \t\n\r\v\f")
ESCAPES = {
r"\a": (LITERAL, ord("\a")),
r"\b": (LITERAL, ord("\b")),
r"\f": (LITERAL, ord("\f")),
r"\n": (LITERAL, ord("\n")),
r"\r": (LITERAL, ord("\r")),
r"\t": (LITERAL, ord("\t")),
r"\v": (LITERAL, ord("\v")),
r"\\": (LITERAL, ord("\\"))
}
FLAGS = {
# standard flags
"i": SRE_FLAG_IGNORECASE,
"L": SRE_FLAG_LOCALE,
"m": SRE_FLAG_MULTILINE,
"s": SRE_FLAG_DOTALL,
"x": SRE_FLAG_VERBOSE,
# extensions
"a": SRE_FLAG_ASCII,
"t": SRE_FLAG_TEMPLATE,
"u": SRE_FLAG_UNICODE,
}