4

I'm trying to remove all the punctuation from a dataframe, except the characters '<' and '>'

I tried:

def non_punct(df):

    df['C'] = df['C'].str.replace('[^\w\s]' | ~(<) | ~(>),' ')

    return df

Output:

    File "<ipython-input-292-ac8369672f62>", line 3
        df['Description'] = df['Description'].str.replace('[^\w\s]' | ~(<) | ~(>),' ')
                                                                ^
SyntaxError: invalid syntax

My dataframe:

       A          B                                    C
  French      house               Phone. <phone_numbers>
 English      house               email - <adresse_mail>
  French  apartment                      my name is Liam
  French      house                        Hello George!
 English  apartment   Ethan, my phone is <phone_numbers>

Good output:

       A          B                                    C
  French      house               Phone <phone_numbers>
 English      house               email  <adresse_mail>
  French  apartment                     my name is Liam
  French      house                        Hello George 
 English  apartment   Ethan my phone is <phone_numbers>
marin
  • 923
  • 2
  • 18
  • 26

3 Answers3

2

Here's a way to achieve your result with re.sub. Also, I think your regex is off, It should be [[^\w\s^<^>]|_. This matches everything that is not a number, integer, whitespace, < or >. You have to explicitly match the underscore because that is exempted in \w.

import re
re.sub('[^\w\s^<^>]|_', ' ', 'asdf.,:;/\><a b_?!"§$%&a')
>>> 'asdf      ><a b        a'

Just as a comparison:

re.sub('[^\w\s] | ~(<) | ~(>)', ' ', 'asdf.,:;/\><a b_?!"§$%&a')
>>> 'asdf.,:;/\\><a b_?!"§$%&a'

re.sub('[^\w\s^<^>]', ' ', 'asdf.,:;/\><a b_?!"§$%&a')
>>> 'asdf      ><a b_       a'

EDIT: Your error results from a misplaced quotation mark: it should be '[^\w\s] | ~(<) | ~(>)' and not '[^\w\s]' | ~(<) | ~(>)

EDIT 2: as pointed out by @Brad Solomon, pd.Series.str.replace does perfectly well with regex, so adding [[^\w\s^<^>]|_ as the pattern to match in your statement should do the trick. Haven't tested that, though. @marin: If you happen to give this a try, leave me feedback so that I can update the post if needed.

Lukas Thaler
  • 2,672
  • 5
  • 15
  • 31
  • 1
    You're right! Thanks for pointing out my mistake. I Should learn to read properly. Will edit my post to no longer state that – Lukas Thaler Nov 07 '18 at 15:50
1

Here's a way with string.punctuation:

>>> import re
>>> import string

>>> import pandas as pd

>>> df = pd.DataFrame({
...     'a': ['abc', 'de.$&$*f(@)<', '<g>hij<k>'],
...     'b': [1234, 5678, 91011],
...     'c': ['me <me@gmail.com>', '123 West-End Lane', '<<xyz>>']
... })

>>> punc = string.punctuation.replace('<', '').replace('>', '')

>>> pat = re.compile(f'[{punc}]')
>>> df.replace(pat, '')
           a      b                 c
0        abc   1234   me <megmailcom>
1       def<   5678  123 WestEnd Lane
2  <g>hij<k>  91011           <<xyz>>

You should double-check that this constant is inclusive of what you want:

String of ASCII characters which are considered punctuation characters in the C locale.

Values:

>>> string.punctuation
'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
>>> string.punctuation.replace('<', '').replace('>', '')
'!"#$%&\'()*+,-./:;=?@[\\]^_`{|}~'

Notes:

  • This solution uses an f-string (Python 3.6+)
  • It encloses those literal characters in a character set to match any of them
  • Note the difference between df.replace() and df[my_column_name].str.replace(). The signature for pd.DataFrame.replace() is DataFrame.replace(to_replace=None, value=None, inplace=False, limit=None, regex=False, method='pad'), where to_replace can be a regex.
Brad Solomon
  • 38,521
  • 31
  • 149
  • 235
1

In single line (aside from import) it would be:

import string
df['C'] = df['C'].str.translate(None, string.translate(string.punctuation, None, '<>'))
zipa
  • 27,316
  • 6
  • 40
  • 58