1

This is the function I'm using to remove punctuations from a column in pandas.

def remove_punctuation(text):
    return re.sub(r'[^\w\s]','',text)

This is how I'm applying it.

review_without_punctuation = products['review'].apply(remove_punctuation)

Here products is the pandas dataframe.

This is the error message that I get.

TypeError                                 Traceback (most recent call last)
<ipython-input-19-196c188dfb67> in <module>()
----> 1 review_without_punctuation = products['review'].apply(remove_punctuation)

/Users/username/Dropbox/workspace/private/pydev/ml/classification/.env/lib/python3.6/site-packages/pandas/core/series.py in apply(self, func, convert_dtype, args, **kwds)
   2292             else:
   2293                 values = self.asobject
-> 2294                 mapped = lib.map_infer(values, f, convert=convert_dtype)
   2295 
   2296         if len(mapped) and isinstance(mapped[0], Series):

pandas/src/inference.pyx in pandas.lib.map_infer (pandas/lib.c:66124)()

<ipython-input-18-0950dc65d8b8> in remove_punctuation(text)
      1 def remove_punctuation(text):
----> 2     return re.sub(r'[^\w\s]','',text)

/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/re.py in sub(pattern, repl, string, count, flags)
    189     a callable, it's passed the match object and must return
    190     a replacement string to be used."""
--> 191     return _compile(pattern, flags).sub(repl, string, count)
    192 
    193 def subn(pattern, repl, string, count=0, flags=0):

TypeError: expected string or bytes-like object

What am I doing wrong.

Melissa Stewart
  • 3,483
  • 11
  • 49
  • 88

2 Answers2

1

You should always try to avoid running pure Python code via apply() in Pandas. It's slow. Instead, use the special str property which exists on every Pandas string series:

In [9]: s = pd.Series(['hello', 'a,b,c', 'hmm...'])
In [10]: s.str.replace(r'[^\w\s]', '')
Out[10]: 
0    hello
1      abc
2      hmm
dtype: object
John Zwinck
  • 239,568
  • 38
  • 324
  • 436
  • Would you please elaborate on your `avoid running pure Python code via apply() in Pandas`??? I always thought it is the right way as it's fastest because it's vectorized and most straightforward. – Sergey Bushmanov Mar 19 '17 at 03:06
  • 1
    @SergeyBushmanov: To me, "vectorized" means "There is no loop executing Python code on each row individually." All `apply()` does is run Python code on each row individually. And it's slow. It's just a pretty way of doing something you should not do. If you pass a `ufunc` (a special kind of function which is not a regular Python function) to `apply()`, then it is truly "vectorized" (i.e. fast). – John Zwinck Mar 19 '17 at 03:12
  • Can you tell what is the part `r'[^\w\s]'` doing or what this regex there for? – asn May 13 '20 at 15:42
  • 1
    @KPMG: `\w` means words and `\s` means spaces. The `[^` in front of them means "anything other than these." So we are replacing anything other than words and spaces with the empty string, i.e. we are removing punctuation and other unwanted characters. – John Zwinck May 15 '20 at 12:22
0

It does not work because your apply is applied wrongly.

The correct way to do it is:

import re
s = pd.Series(['hello', 'a,b,c', 'hmm...'])
s.apply(lambda x: re.sub(r'[^\w\s]', '',x))
0    hello
1      abc
2      hmm
dtype: object

(hat tip to @John Zwinck for regex)

Comparing this to another solution:

%timeit s.apply(lambda x: re.sub(r'[^\w\s]', '',x))
%timeit s.str.replace(r'[^\w\s]', '')
1000 loops, best of 3: 275 µs per loop
1000 loops, best of 3: 310 µs per loop
Sergey Bushmanov
  • 23,310
  • 7
  • 53
  • 72
  • The `str.replace()` method is faster than `apply()` when the series is long. But not by too much, because the function being applied is trivial. – John Zwinck Mar 19 '17 at 03:15