Trying to remove punctuations from a column in Pandas

Question

This is the function I'm using to remove punctuations from a column in pandas.

def remove_punctuation(text):
    return re.sub(r'[^\w\s]','',text)

This is how I'm applying it.

review_without_punctuation = products['review'].apply(remove_punctuation)

Here products is the pandas dataframe.

This is the error message that I get.

TypeError                                 Traceback (most recent call last)
<ipython-input-19-196c188dfb67> in <module>()
----> 1 review_without_punctuation = products['review'].apply(remove_punctuation)

/Users/username/Dropbox/workspace/private/pydev/ml/classification/.env/lib/python3.6/site-packages/pandas/core/series.py in apply(self, func, convert_dtype, args, **kwds)
   2292             else:
   2293                 values = self.asobject
-> 2294                 mapped = lib.map_infer(values, f, convert=convert_dtype)
   2295 
   2296         if len(mapped) and isinstance(mapped[0], Series):

pandas/src/inference.pyx in pandas.lib.map_infer (pandas/lib.c:66124)()

<ipython-input-18-0950dc65d8b8> in remove_punctuation(text)
      1 def remove_punctuation(text):
----> 2     return re.sub(r'[^\w\s]','',text)

/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/re.py in sub(pattern, repl, string, count, flags)
    189     a callable, it's passed the match object and must return
    190     a replacement string to be used."""
--> 191     return _compile(pattern, flags).sub(repl, string, count)
    192 
    193 def subn(pattern, repl, string, count=0, flags=0):

TypeError: expected string or bytes-like object

What am I doing wrong.

Can you check if there is any 'nan' or non-string value in any row of the column `review`? — Ali, Mar 19 '17 at 01:51

score 1 · Accepted Answer · answered Mar 19 '17 at 01:50

1

You should always try to avoid running pure Python code via apply() in Pandas. It's slow. Instead, use the special str property which exists on every Pandas string series:

In [9]: s = pd.Series(['hello', 'a,b,c', 'hmm...'])
In [10]: s.str.replace(r'[^\w\s]', '')
Out[10]: 
0    hello
1      abc
2      hmm
dtype: object

answered Mar 19 '17 at 01:50

John Zwinck

239,568
38
324
436

Would you please elaborate on your `avoid running pure Python code via apply() in Pandas`??? I always thought it is the right way as it's fastest because it's vectorized and most straightforward. – Sergey Bushmanov Mar 19 '17 at 03:06
1

@SergeyBushmanov: To me, "vectorized" means "There is no loop executing Python code on each row individually." All `apply()` does is run Python code on each row individually. And it's slow. It's just a pretty way of doing something you should not do. If you pass a `ufunc` (a special kind of function which is not a regular Python function) to `apply()`, then it is truly "vectorized" (i.e. fast). – John Zwinck Mar 19 '17 at 03:12
Can you tell what is the part `r'[^\w\s]'` doing or what this regex there for? – asn May 13 '20 at 15:42
1

@KPMG: `\w` means words and `\s` means spaces. The `[^` in front of them means "anything other than these." So we are replacing anything other than words and spaces with the empty string, i.e. we are removing punctuation and other unwanted characters. – John Zwinck May 15 '20 at 12:22

score 0 · Answer 2 · answered Mar 19 '17 at 03:05

0

It does not work because your apply is applied wrongly.

The correct way to do it is:

import re
s = pd.Series(['hello', 'a,b,c', 'hmm...'])
s.apply(lambda x: re.sub(r'[^\w\s]', '',x))
0    hello
1      abc
2      hmm
dtype: object

(hat tip to @John Zwinck for regex)

Comparing this to another solution:

%timeit s.apply(lambda x: re.sub(r'[^\w\s]', '',x))
%timeit s.str.replace(r'[^\w\s]', '')
1000 loops, best of 3: 275 µs per loop
1000 loops, best of 3: 310 µs per loop

answered Mar 19 '17 at 03:05

Sergey Bushmanov

23,310
7
53
72

The `str.replace()` method is faster than `apply()` when the series is long. But not by too much, because the function being applied is trivial. – John Zwinck Mar 19 '17 at 03:15

Trying to remove punctuations from a column in Pandas

2 Answers2