Complete set of punctuation marks for Python (not just ASCII)

Question

Is there a listing or library that has all punctuations that we might commonly come across?

Normally I use string.punctuation, but some punctuation characters are not included in it, for example:

>>> "'" in string.punctuation
True
>>> "’" in string.punctuation
False

Does this answer your question? [Best way to strip punctuation from a string](https://stackoverflow.com/questions/265960/best-way-to-strip-punctuation-from-a-string) — airstrike, Apr 02 '20 at 03:39

wim · Accepted Answer · 2020-04-03T18:59:48.167

You might do better with this check:

>>> import unicodedata
>>> unicodedata.category("'").startswith("P")
True
>>> unicodedata.category("’").startswith("P")
True

The Unicode categories P* are specifically for Punctuation:

connector (Pc), dash (Pd), initial quote (Pi), final quote (Pf), open (Ps), close (Pe), other (Po)

To prepare the exhaustive collection, which you can subsequently use for fast membership checks, use a set comprehension:

>>> import sys
>>> from unicodedata import category
>>> codepoints = range(sys.maxunicode + 1)
>>> punctuation = {c for i in codepoints if category(c := chr(i)).startswith("P")}
>>> "'" in punctuation
True
>>> "’" in punctuation
True

Assignment expression here requires Python 3.8+, equivalent for older Python versions:

chrs = (chr(i) for i in range(sys.maxunicode + 1))
punctuation = set(c for c in chrs if category(c).startswith("P"))

Beware that some of the other characters in string.punctuation are actually in Unicode category Symbol. It's easy to add those in also if you want.

A reasonable definition of “punctuation” would include the Unicode “Symbol” categories Sc (currency, like `$`), Sk (modifier, like `^`), Sm (math, like `+` or `<`), and maybe So (other, like `©`). — dan04, Apr 02 '20 at 19:05
@dan04 That's what the last para of the answer is mentioning about. Of course others can adapt this code to include/exclude categories depending on their own use case. — wim, Apr 02 '20 at 19:13

Selcuk · Answer 2 · 2020-04-21T23:33:23.940

18

The answer posted by wim is correct if you want to check if a character is a punctuation character.

If you really need a list of all punctuation characters as your question title suggests, you can use the following:

import sys
from unicodedata import category
punctuation_chars =  [chr(i) for i in range(sys.maxunicode) 
                             if category(chr(i)).startswith("P")]

edited Apr 21 '20 at 23:33

answered Apr 02 '20 at 03:39

Selcuk

57,004
12
102
110

score 2 · Answer 3 · edited Apr 03 '20 at 13:36

2

The answer by wim is great if you can change your code to use a function.

But if you have to use the in operator (for example, you're calling into library code), you can use duck typing:

import unicodedata
class DuckType:
    def __contains__(self,s):
        return unicodedata.category(s).startswith("P")
punct=DuckType()
#print("'" in punct,'"' in punct,"a" in punct)

edited Apr 03 '20 at 13:36

Peter Mortensen

30,738
21
105
131

answered Apr 02 '20 at 03:40

xkcdjerry

965
4
15

score 1 · Answer 4 · edited Apr 03 '20 at 13:38

1

That seems like a pretty job for a regular expression (regexp):

    import re
    text = re.sub(r"[^\w\s]", "", str(text), flags=re.UNICODE)

Here, the regexp is matching everything except whitespaces or word characters. The flag re.UNICODE is used to match over full set of Unicode characters.

edited Apr 03 '20 at 13:38

Peter Mortensen

30,738
21
105
131

answered Apr 02 '20 at 03:43

Nicolas Martinez

719
1
6
23

doesn't work with many languages: `>>> text="Den som dræber - fanget" >>> re.sub(r"[^\w\s]", "", str(text), flags=re.UNICODE) 'Den som dr\xc3ber fanget'` – samuelbrody1249 Apr 02 '20 at 04:38
1

@samuelbrody1249 What do you mean it doesn't work? It does work in your example (the `\xc3` escape is a representation thing unrelated to the stripping of punctuation). – lenz Apr 02 '20 at 07:28
1

@lenz `\xc3` is not the correct Unicode encoding of `æ`; if you type `str(text)` you can confirm that it is `\xc3\xa6`. Actually `\xc3` does not seem to be a complete codepoint. – Federico Poloni Apr 02 '20 at 13:50
6

Oh I see. It seems you both are using Python 2, where `str` is a byte string. You should definitely switch to Python 3, because Unicode is a nightmare in Py2. For me, `str('æ')` shows as `'æ'`, and `ascii('æ')` shows as `'\xe6'`, which is the correct codepoint. `b'\xc3\xa6'` is the UTF-8 encoding of `'æ'`, but this isn't usually what you want to work with. – lenz Apr 02 '20 at 18:55

score 0 · Answer 5 · answered Apr 07 '20 at 20:54

As other answers have pointed out, the way to do this is via Unicode properties/categories. The accepted answer accesses this information via the standard library unicodedata module, but depending on the context where you need this, it might be faster or more convenient to access this same property information using regular expressions.

However, the standard library re module does not provide extended Unicode support. For that, you need the regex module, available on PyPI (pip install regex):

>>> import regex as re
>>> re.match("\p{Punctuation}", "'")
<regex.Match object; span=(0, 1), match="'">
>>> re.match("\p{Punctuation}", "’")
<regex.Match object; span=(0, 1), match='’'>

A good overview of all the different kinds of Unicode properties you can search for using regular expressions is provided here. Apart from these extra regular expression features, which are documented on its PyPI homepage, regex deliberately provides the same API as re, so you're expected to use re's documentation to figure out how to use either of them.

Complete set of punctuation marks for Python (not just ASCII)

5 Answers5