Fastest way to strip punctuation from a unicode string in Python

Question

I am trying to efficiently strip punctuation from a unicode string. With a regular string, using mystring.translate(None, string.punctuation) is clearly the fastest approach. However, this code breaks on a unicode string in Python 2.7. As the comments to this answer explain, the translate method can still be implemented, but it must be implement with a dictionary. When I use this implementation though, I find that translate's performance is dramatically reduced. Here is my timing code (copied primarily from this answer):

import re, string, timeit
import unicodedata
import sys


#String from this article www.wired.com/design/2013/12/find-the-best-of-reddit-with-this-interactive-map/

s = "For me, Reddit brings to mind Obi Wan’s enduring description of the Mos Eisley cantina: a wretched hive of scum and villainy. But, you know, one you still kinda want to hang out in occasionally. The thing is, though, Reddit isn’t some obscure dive bar in a remote corner of the universe—it’s a huge watering hole at the very center of it. The site had some 400 million unique visitors in 2012. They can’t all be Greedos. So maybe my problem is just that I’ve never been able to find the places where the decent people hang out."
su = u"For me, Reddit brings to mind Obi Wan’s enduring description of the Mos Eisley cantina: a wretched hive of scum and villainy. But, you know, one you still kinda want to hang out in occasionally. The thing is, though, Reddit isn’t some obscure dive bar in a remote corner of the universe—it’s a huge watering hole at the very center of it. The site had some 400 million unique visitors in 2012. They can’t all be Greedos. So maybe my problem is just that I’ve never been able to find the places where the decent people hang out."


exclude = set(string.punctuation)
regex = re.compile('[%s]' % re.escape(string.punctuation))

def test_set(s):
    return ''.join(ch for ch in s if ch not in exclude)

def test_re(s):  # From Vinko's solution, with fix.
    return regex.sub('', s)

def test_trans(s):
    return s.translate(None, string.punctuation)

tbl = dict.fromkeys(i for i in xrange(sys.maxunicode)
                      if unicodedata.category(unichr(i)).startswith('P'))

def test_trans_unicode(su):
    return su.translate(tbl)

def test_repl(s):  # From S.Lott's solution
    for c in string.punctuation:
        s=s.replace(c,"")
    return s

print "sets      :",timeit.Timer('f(s)', 'from __main__ import s,test_set as f').timeit(1000000)
print "regex     :",timeit.Timer('f(s)', 'from __main__ import s,test_re as f').timeit(1000000)
print "translate :",timeit.Timer('f(s)', 'from __main__ import s,test_trans as f').timeit(1000000)
print "replace   :",timeit.Timer('f(s)', 'from __main__ import s,test_repl as f').timeit(1000000)

print "sets (unicode)      :",timeit.Timer('f(su)', 'from __main__ import su,test_set as f').timeit(1000000)
print "regex (unicode)     :",timeit.Timer('f(su)', 'from __main__ import su,test_re as f').timeit(1000000)
print "translate (unicode) :",timeit.Timer('f(su)', 'from __main__ import su,test_trans_unicode as f').timeit(1000000)
print "replace (unicode)   :",timeit.Timer('f(su)', 'from __main__ import su,test_repl as f').timeit(1000000)

As my results show, the unicode implementation of translate performs horribly:

sets      : 38.323941946
regex     : 6.7729549408
translate : 1.27428412437
replace   : 5.54967689514

sets (unicode)      : 43.6268708706
regex (unicode)     : 7.32343912125
translate (unicode) : 54.0041439533
replace (unicode)   : 17.4450061321

My question is whether there is a faster way to implement translate for unicode (or any other method) that would outperform regex.

taking a quick look at the C source code the `translate` builtin between `stringobject.c` and `unicodeobject.c` they are indeed implemented very differently. — qwwqwwq, Dec 11 '13 at 20:50
You could probably speed it up, yes. There's a bunch of function calls that are mainly for clarity; these could be inlined. The primary issue for implementation is that unicode chars are more intensive to parse and that the number of possible replacements is much larger (your `tbl` contains 585 characters) which necessitates the map strategy used in `unicodeobject`. Is the regex method too slow? — beerbajay, Dec 11 '13 at 23:39
I was not even thinking about the C implementation, I just meant is there existing python code that can outperform the regex method. The regex method will do fine, it's what I've implemented for now, but it's five times slower and I've got a lot of text so I figured I'd ask. — Michael, Dec 12 '13 at 00:02
What is the total time consumption if you convert your unicode string to a regular string first and then use translate on the converted string? — Dyrborg, Dec 12 '13 at 00:08
`def test_trans_unicode_convert(su): return str(su).translate(None, string.punctuation)` gets an error: `UnicodeEncodeError: 'ascii' codec can't encode characters in position 37-39: ordinal not in range(128)` For whatever reason Wired displays apostrophes different than single quotes. In practice this is what will happen in part of my dataset if I try this approach. — Michael, Dec 12 '13 at 00:25

score 6 · Answer 1 · answered Dec 12 '13 at 03:30

The current test script is flawed, because it does not compare like with like.

For a fairer comparison, all the functions must be run with the same set of punctuation characters (i.e. either all ascii, or all unicode).

When that is done, the regex and replace methods fare much worse with the full set of unicode punctuation characters.

For full unicode, it looks like the "set" method is the best. However, if you only want remove the ascii punctuation characters from unicode strings, it may be best to encode, translate, and decode (depending on the length of the input string).

The "replace" method can also be substantially improved by doing a containment test before attempting replacements (depending on the precise make-up of the string).

Here's some sample results from a re-hash of the test script:

$ python2 test.py
running ascii punctuation test...
using byte strings...

set: 0.862006902695
re: 0.17484498024
trans: 0.0207080841064
enc_trans: 0.0206489562988
repl: 0.157525062561
in_repl: 0.213351011276

$ python2 test.py a
running ascii punctuation test...
using unicode strings...

set: 0.927773952484
re: 0.18892288208
trans: 1.58275294304
enc_trans: 0.0794939994812
repl: 0.413739919662
in_repl: 0.249747991562

python2 test.py u
running unicode punctuation test...
using unicode strings...

set: 0.978360176086
re: 7.97941994667
trans: 1.72471117973
enc_trans: 0.0784001350403
repl: 7.05612301826
in_repl: 3.66821289062

And here's the re-hashed script:

# -*- coding: utf-8 -*-

import re, string, timeit
import unicodedata
import sys


#String from this article www.wired.com/design/2013/12/find-the-best-of-reddit-with-this-interactive-map/

s = """For me, Reddit brings to mind Obi Wan’s enduring description of the Mos
Eisley cantina: a wretched hive of scum and villainy. But, you know, one you
still kinda want to hang out in occasionally. The thing is, though, Reddit
isn’t some obscure dive bar in a remote corner of the universe—it’s a huge
watering hole at the very center of it. The site had some 400 million unique
visitors in 2012. They can’t all be Greedos. So maybe my problem is just that
I’ve never been able to find the places where the decent people hang out."""

su = u"""For me, Reddit brings to mind Obi Wan’s enduring description of the
Mos Eisley cantina: a wretched hive of scum and villainy. But, you know, one
you still kinda want to hang out in occasionally. The thing is, though,
Reddit isn’t some obscure dive bar in a remote corner of the universe—it’s a
huge watering hole at the very center of it. The site had some 400 million
unique visitors in 2012. They can’t all be Greedos. So maybe my problem is
just that I’ve never been able to find the places where the decent people
hang out."""

def test_trans(s):
    return s.translate(tbl)

def test_enc_trans(s):
    s = s.encode('utf-8').translate(None, string.punctuation)
    return s.decode('utf-8')

def test_set(s): # with list comprehension fix
    return ''.join([ch for ch in s if ch not in exclude])

def test_re(s):  # From Vinko's solution, with fix.
    return regex.sub('', s)

def test_repl(s):  # From S.Lott's solution
    for c in punc:
        s = s.replace(c, "")
    return s

def test_in_repl(s):  # From S.Lott's solution, with fix
    for c in punc:
        if c in s:
            s = s.replace(c, "")
    return s

txt = 'su'
ptn = u'[%s]'

if 'u' in sys.argv[1:]:
    print 'running unicode punctuation test...'
    print 'using unicode strings...'
    punc = u''
    tbl = {}
    for i in xrange(sys.maxunicode):
        char = unichr(i)
        if unicodedata.category(char).startswith('P'):
            tbl[i] = None
            punc += char
else:
    print 'running ascii punctuation test...'
    punc = string.punctuation
    if 'a' in sys.argv[1:]:
        print 'using unicode strings...'
        punc = punc.decode()
        tbl = {ord(ch):None for ch in punc}
    else:
        print 'using byte strings...'
        txt = 's'
        ptn = '[%s]'
        def test_trans(s):
            return s.translate(None, punc)
        test_enc_trans = test_trans

exclude = set(punc)
regex = re.compile(ptn % re.escape(punc))

def time_func(func, n=10000):
    timer = timeit.Timer(
        'func(%s)' % txt,
        'from __main__ import %s, test_%s  as func' % (txt, func))
    print '%s: %s' % (func, timer.timeit(n))

print
time_func('set')
time_func('re')
time_func('trans')
time_func('enc_trans')
time_func('repl')
time_func('in_repl')

+1 @ekhumoro, I can't sleep :-) I was thinking about your comments and you are right, the question is incorrectly developed, your answer is the right one (IMHO), so I removed the mine. — Roberto, Mar 24 '14 at 00:38
@RobertoSánchez. Thanks! And hope it doesn't give you nightmares ;-) — ekhumoro, Mar 24 '14 at 00:41

Fastest way to strip punctuation from a unicode string in Python

1 Answers1