41

Is there a listing or library that has all punctuations that we might commonly come across?

Normally I use string.punctuation, but some punctuation characters are not included in it, for example:

>>> "'" in string.punctuation
True
>>> "’" in string.punctuation
False
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
samuelbrody1249
  • 4,379
  • 1
  • 15
  • 58

5 Answers5

62

You might do better with this check:

>>> import unicodedata
>>> unicodedata.category("'").startswith("P")
True
>>> unicodedata.category("’").startswith("P")
True

The Unicode categories P* are specifically for Punctuation:

connector (Pc), dash (Pd), initial quote (Pi), final quote (Pf), open (Ps), close (Pe), other (Po)

To prepare the exhaustive collection, which you can subsequently use for fast membership checks, use a set comprehension:

>>> import sys
>>> from unicodedata import category
>>> codepoints = range(sys.maxunicode + 1)
>>> punctuation = {c for i in codepoints if category(c := chr(i)).startswith("P")}
>>> "'" in punctuation
True
>>> "’" in punctuation
True

Assignment expression here requires Python 3.8+, equivalent for older Python versions:

chrs = (chr(i) for i in range(sys.maxunicode + 1))
punctuation = set(c for c in chrs if category(c).startswith("P"))

Beware that some of the other characters in string.punctuation are actually in Unicode category Symbol. It's easy to add those in also if you want.

wim
  • 338,267
  • 99
  • 616
  • 750
  • A reasonable definition of “punctuation” would include the Unicode “Symbol” categories Sc (currency, like `$`), Sk (modifier, like `^`), Sm (math, like `+` or `<`), and maybe So (other, like `©`). – dan04 Apr 02 '20 at 19:05
  • 3
    @dan04 That's what the last para of the answer is mentioning about. Of course others can adapt this code to include/exclude categories depending on their own use case. – wim Apr 02 '20 at 19:13
18

The answer posted by wim is correct if you want to check if a character is a punctuation character.

If you really need a list of all punctuation characters as your question title suggests, you can use the following:

import sys
from unicodedata import category
punctuation_chars =  [chr(i) for i in range(sys.maxunicode) 
                             if category(chr(i)).startswith("P")]
Selcuk
  • 57,004
  • 12
  • 102
  • 110
2

The answer by wim is great if you can change your code to use a function.

But if you have to use the in operator (for example, you're calling into library code), you can use duck typing:

import unicodedata
class DuckType:
    def __contains__(self,s):
        return unicodedata.category(s).startswith("P")
punct=DuckType()
#print("'" in punct,'"' in punct,"a" in punct)
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
xkcdjerry
  • 965
  • 4
  • 15
1

That seems like a pretty job for a regular expression (regexp):

    import re
    text = re.sub(r"[^\w\s]", "", str(text), flags=re.UNICODE)

Here, the regexp is matching everything except whitespaces or word characters. The flag re.UNICODE is used to match over full set of Unicode characters.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Nicolas Martinez
  • 719
  • 1
  • 6
  • 23
  • doesn't work with many languages: `>>> text="Den som dræber - fanget" >>> re.sub(r"[^\w\s]", "", str(text), flags=re.UNICODE) 'Den som dr\xc3ber fanget'` – samuelbrody1249 Apr 02 '20 at 04:38
  • 1
    @samuelbrody1249 What do you mean it doesn't work? It does work in your example (the `\xc3` escape is a representation thing unrelated to the stripping of punctuation). – lenz Apr 02 '20 at 07:28
  • 1
    @lenz `\xc3` is not the correct Unicode encoding of `æ`; if you type `str(text)` you can confirm that it is `\xc3\xa6`. Actually `\xc3` does not seem to be a complete codepoint. – Federico Poloni Apr 02 '20 at 13:50
  • 6
    Oh I see. It seems you both are using Python 2, where `str` is a byte string. You should definitely switch to Python 3, because Unicode is a nightmare in Py2. For me, `str('æ')` shows as `'æ'`, and `ascii('æ')` shows as `'\xe6'`, which is the correct codepoint. `b'\xc3\xa6'` is the UTF-8 encoding of `'æ'`, but this isn't usually what you want to work with. – lenz Apr 02 '20 at 18:55
0

As other answers have pointed out, the way to do this is via Unicode properties/categories. The accepted answer accesses this information via the standard library unicodedata module, but depending on the context where you need this, it might be faster or more convenient to access this same property information using regular expressions.

However, the standard library re module does not provide extended Unicode support. For that, you need the regex module, available on PyPI (pip install regex):

>>> import regex as re
>>> re.match("\p{Punctuation}", "'")
<regex.Match object; span=(0, 1), match="'">
>>> re.match("\p{Punctuation}", "’")
<regex.Match object; span=(0, 1), match='’'>

A good overview of all the different kinds of Unicode properties you can search for using regular expressions is provided here. Apart from these extra regular expression features, which are documented on its PyPI homepage, regex deliberately provides the same API as re, so you're expected to use re's documentation to figure out how to use either of them.

dlukes
  • 1,313
  • 16
  • 27