2

This python doc gives a complete list of the metacharacters

. ^ $ * + ? { } [ ] \ | ( )

Similarly, is there a page giving a complete list of character class?

I assume "Character classes" in that doc refers to a finite numbers of some kind of special characters instead of all possible unicode characters. Please correct me if necessary.

I did search and didn't find the canonical term.

If "character classes" indeed refers to all possible unicode characters, I would like change my question as "a convenient way to lookup regex special characters in python".

It seems that regular-expressions.info call that "Shorthand Character Classes"

More positive examples (that I am looking for) are \d, \s, \S, \A etc; negative examples (that I am not looking for) are abcdefghijklmnopqrstuvwxyz0123456789

I've searched "character class" and "Shorthand Character Classes" on Python doc and stackoverflow and didn't find what I want.

Why do I need this? When I read a section of the doc, such as

Character classes such as \w or \S (defined below) are also accepted inside a set, although the characters they match depends on whether ASCII or LOCALE mode is in force.

I would like to know what does \w stand for. Either searching in the doc or google would take me some time. For example, using search menu command of chrome on that doc, \w gets 41 results.

If there is a list of those characters, I can look up everything by no more than 2 search (lower case and capital).

  • 2
    ...what? That's like asking for a complete list of all integers. Why are you asking this? – user2357112 Sep 29 '19 at 07:37
  • 1
    From the [Python documentation](https://docs.python.org/3/library/re.html), the characters matched by `\w` depend on the locale, so there is no one single answer to your question. – Tim Biegeleisen Sep 29 '19 at 07:38
  • @TimBiegeleisen: They depend on the locale if you're using bytestrings and specifically turn on the flag that says to use locale settings. That's an uncommon case. – user2357112 Sep 29 '19 at 07:44
  • @user2357112 I've updated the OP, would you please take a loot at it? –  Sep 29 '19 at 07:50
  • 1
    The `re` docs don't make this clear, but the term "character class" includes things like `[abc]` as well as shorthand classes like `\w`. That particular doc section is only referring to shorthand character classes, though. (The `re` docs aren't great as a single source for learning about regexes.) – user2357112 Sep 29 '19 at 07:55
  • "More positive examples (that I am looking for) are `\d`, `\s`, `\S`, `\A`" - `\A` isn't a character class. – user2357112 Sep 29 '19 at 08:18
  • @user2357112 I am actually not sure `\d`, `\s`, `\S`, `\A` are character classes. So, what are they in the context Python regex? –  Sep 29 '19 at 08:23
  • `\d`, `\s`, and `\S` are character classes. `\A` isn't a character class; it doesn't match a set of characters. It matches the beginning of the string. – user2357112 Sep 29 '19 at 08:28
  • 1
    Putting `\w` or `\S` in brackets will work, as described by the docs you quoted, but `\A` won't, because it's not a character class. – user2357112 Sep 29 '19 at 08:30

3 Answers3

3

Categories Visible from the Shell

The code shows all the of the "CATEGORIES". The ones marked "IN" are character categories (the others mark specific slice points between characters):

>>> from pprint import pprint
>>> import sre_parse

>>> pprint(sre_parse.CATEGORIES)
{'\\A': (AT, AT_BEGINNING_STRING),
 '\\B': (AT, AT_NON_BOUNDARY),
 '\\D': (IN, [(CATEGORY, CATEGORY_NOT_DIGIT)]),
 '\\S': (IN, [(CATEGORY, CATEGORY_NOT_SPACE)]),
 '\\W': (IN, [(CATEGORY, CATEGORY_NOT_WORD)]),
 '\\Z': (AT, AT_END_STRING),
 '\\b': (AT, AT_BOUNDARY),
 '\\d': (IN, [(CATEGORY, CATEGORY_DIGIT)]),
 '\\s': (IN, [(CATEGORY, CATEGORY_SPACE)]),
 '\\w': (IN, [(CATEGORY, CATEGORY_WORD)])

The entries with "CATEGORY" are the character categories

This also answers the question of what \w stands for. It is a "word character". See also: In regex, what does \w* mean?

Categories Explained in the Docs

This is in the output of print(re.__doc__). It explains the intended meaning of each category:

The special sequences consist of "\\" and a character from the list
below.  If the ordinary character is not on the list, then the
resulting RE will match the second character.
    \number  Matches the contents of the group of the same number.
    \A       Matches only at the start of the string.
    \Z       Matches only at the end of the string.
    \b       Matches the empty string, but only at the start or end of a word.
    \B       Matches the empty string, but not at the start or end of a word.
    \d       Matches any decimal digit; equivalent to the set [0-9] in
             bytes patterns or string patterns with the ASCII flag.
             In string patterns without the ASCII flag, it will match the whole
             range of Unicode digits.
    \D       Matches any non-digit character; equivalent to [^\d].
    \s       Matches any whitespace character; equivalent to [ \t\n\r\f\v] in
             bytes patterns or string patterns with the ASCII flag.
             In string patterns without the ASCII flag, it will match the whole
             range of Unicode whitespace characters.
    \S       Matches any non-whitespace character; equivalent to [^\s].
    \w       Matches any alphanumeric character; equivalent to [a-zA-Z0-9_]
             in bytes patterns or string patterns with the ASCII flag.
             In string patterns without the ASCII flag, it will match the
             range of Unicode alphanumeric characters (letters plus digits
             plus underscore).
             With LOCALE, it will match the set [0-9_] plus characters defined
             as letters for the current locale.
    \W       Matches the complement of \w.
    \\       Matches a literal backslash.

Other Special Character Groups

Besides the short-hand character classes, the sre_parse module details other interesting character groups as well:

SPECIAL_CHARS = ".\\[{()*+?^$|"
REPEAT_CHARS = "*+?{"
DIGITS = frozenset("0123456789")
OCTDIGITS = frozenset("01234567")
HEXDIGITS = frozenset("0123456789abcdefABCDEF")
ASCIILETTERS = frozenset("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ")
WHITESPACE = frozenset(" \t\n\r\v\f")

ESCAPES = {
    r"\a": (LITERAL, ord("\a")),
    r"\b": (LITERAL, ord("\b")),
    r"\f": (LITERAL, ord("\f")),
    r"\n": (LITERAL, ord("\n")),
    r"\r": (LITERAL, ord("\r")),
    r"\t": (LITERAL, ord("\t")),
    r"\v": (LITERAL, ord("\v")),
    r"\\": (LITERAL, ord("\\"))
}

FLAGS = {
    # standard flags
    "i": SRE_FLAG_IGNORECASE,
    "L": SRE_FLAG_LOCALE,
    "m": SRE_FLAG_MULTILINE,
    "s": SRE_FLAG_DOTALL,
    "x": SRE_FLAG_VERBOSE,
    # extensions
    "a": SRE_FLAG_ASCII,
    "t": SRE_FLAG_TEMPLATE,
    "u": SRE_FLAG_UNICODE,
}
Raymond Hettinger
  • 216,523
  • 63
  • 388
  • 485
1

It looks like you're looking for all shorthand character classes Python's re module supports. Things like [abc] also fall under the name "character class", although this might not be obvious from the re docs, and it would be impossible and pointless to try to make a complete list of those.

A character class is regex syntax for matching a single character, usually by specifying that it belongs or doesn't belong to some set of characters. Syntax like [abc] lets you explicitly specify a set of characters to match, while shorthand character classes like \d are shorthand for large, predefined sets of characters.

Python's re module supports 6 shorthand character classes: \d, which matches digits, \s, which matches whitespace, \w, which matches "word" characters, and \D, \S, and \W, which match any character \d, \s, and \w don't match. Exactly which characters count or don't count depend on whether you're using Unicode strings or bytestrings and whether the ASCII or LOCALE flags are set; see the re docs for further details (and expect disappointment with the vague docs for \w).

There are plenty of other backslash-letter sequences with special meaning, but they're not character classes. For example, \b matches a word boundary (or if you forgot to use raw strings, it gets interpreted as a backspace character before the regex engine gets to see it), but that's not a character class.

Other regex implementations may support different shorthand character classes, and their shorthand character classes may match different characters. For example, Perl has way more of these, and Perl's \w matches more characters than Python's, like combining diacritics.

user2357112
  • 260,549
  • 28
  • 431
  • 505
  • I guess you and I were typing/pasting "shorthand character classes" at the same time :-) "All shorthand character classes Python's re module supports", that is exactly what I am looking for! You are nice! –  Sep 29 '19 at 08:17
0

Are you looking for string.printable or perhaps filter(lambda x: not x.isalnum(), string.printable) which returns

!"#$%&\'()*+,-./:;<=>?@[\\]^_``{|}~ \t\n\r\x0b\x0c

?

Shay Nehmad
  • 1,103
  • 1
  • 12
  • 25