Remove all the punctuation from a dataframe, except some characters

Question

I'm trying to remove all the punctuation from a dataframe, except the characters '<' and '>'

I tried:

def non_punct(df):

    df['C'] = df['C'].str.replace('[^\w\s]' | ~(<) | ~(>),' ')

    return df

Output:

    File "<ipython-input-292-ac8369672f62>", line 3
        df['Description'] = df['Description'].str.replace('[^\w\s]' | ~(<) | ~(>),' ')
                                                                ^
SyntaxError: invalid syntax

My dataframe:

       A          B                                    C
  French      house               Phone. <phone_numbers>
 English      house               email - <adresse_mail>
  French  apartment                      my name is Liam
  French      house                        Hello George!
 English  apartment   Ethan, my phone is <phone_numbers>

Good output:

       A          B                                    C
  French      house               Phone <phone_numbers>
 English      house               email  <adresse_mail>
  French  apartment                     my name is Liam
  French      house                        Hello George 
 English  apartment   Ethan my phone is <phone_numbers>

Could you please add desired output, because it looks like you are just missing quotes in your regex? — zipa, Nov 07 '18 at 14:29

Lukas Thaler · Answer 1 · 2018-11-07T15:54:09.647

Here's a way to achieve your result with re.sub. Also, I think your regex is off, It should be [[^\w\s^<^>]|_. This matches everything that is not a number, integer, whitespace, < or >. You have to explicitly match the underscore because that is exempted in \w.

import re
re.sub('[^\w\s^<^>]|_', ' ', 'asdf.,:;/\><a b_?!"§$%&a')
>>> 'asdf      ><a b        a'

Just as a comparison:

re.sub('[^\w\s] | ~(<) | ~(>)', ' ', 'asdf.,:;/\><a b_?!"§$%&a')
>>> 'asdf.,:;/\\><a b_?!"§$%&a'

re.sub('[^\w\s^<^>]', ' ', 'asdf.,:;/\><a b_?!"§$%&a')
>>> 'asdf      ><a b_       a'

EDIT: Your error results from a misplaced quotation mark: it should be '[^\w\s] | ~(<) | ~(>)' and not '[^\w\s]' | ~(<) | ~(>)

EDIT 2: as pointed out by @Brad Solomon, pd.Series.str.replace does perfectly well with regex, so adding [[^\w\s^<^>]|_ as the pattern to match in your statement should do the trick. Haven't tested that, though. @marin: If you happen to give this a try, leave me feedback so that I can update the post if needed.

You're right! Thanks for pointing out my mistake. I Should learn to read properly. Will edit my post to no longer state that — Lukas Thaler, Nov 07 '18 at 15:50

Brad Solomon · Accepted Answer · 2018-11-07T14:39:24.720

Here's a way with string.punctuation:

>>> import re
>>> import string

>>> import pandas as pd

>>> df = pd.DataFrame({
...     'a': ['abc', 'de.$&$*f(@)<', '<g>hij<k>'],
...     'b': [1234, 5678, 91011],
...     'c': ['me <me@gmail.com>', '123 West-End Lane', '<<xyz>>']
... })

>>> punc = string.punctuation.replace('<', '').replace('>', '')

>>> pat = re.compile(f'[{punc}]')
>>> df.replace(pat, '')
           a      b                 c
0        abc   1234   me <megmailcom>
1       def<   5678  123 WestEnd Lane
2  <g>hij<k>  91011           <<xyz>>

You should double-check that this constant is inclusive of what you want:

String of ASCII characters which are considered punctuation characters in the C locale.

Values:

>>> string.punctuation
'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
>>> string.punctuation.replace('<', '').replace('>', '')
'!"#$%&\'()*+,-./:;=?@[\\]^_`{|}~'

Notes:

This solution uses an f-string (Python 3.6+)
It encloses those literal characters in a character set to match any of them
Note the difference between df.replace() and df[my_column_name].str.replace(). The signature for pd.DataFrame.replace() is DataFrame.replace(to_replace=None, value=None, inplace=False, limit=None, regex=False, method='pad'), where to_replace can be a regex.

score 1 · Answer 3 · answered Nov 07 '18 at 14:42

1

In single line (aside from import) it would be:

import string
df['C'] = df['C'].str.translate(None, string.translate(string.punctuation, None, '<>'))

answered Nov 07 '18 at 14:42

zipa

27,316
6
40
58

Remove all the punctuation from a dataframe, except some characters

3 Answers3