remove punctuation for each row in a pandas data frame

Question

I am new to python so this may be a very basic question. I am trying to use lambda to remove punctuation for each row in a pandas dataframe. I used the following, but received an error. I am trying to avoid having convert the df into a list then append the cleaned results into new list, then convert it back to a df.

Any suggestions would be appreciated!

import string

df['cleaned'] = df['old'].apply(lambda x: x.replace(c,'') for c in string.punctuation)

mechanical_meat · Accepted Answer · 2015-10-09T22:59:03.420

13

You need to iterate over the string in the dataframe, not over string.punctuation. You also need to build the string back up using .join().

df['cleaned'] = df['old'].apply(lambda x:''.join([i for i in x 
                                                  if i not in string.punctuation]))

When lambda expressions get long like that it can be more readable to write out the function definition separately, e.g. (thanks to @AndyHayden for the optimization tips):

def remove_punctuation(s):
    s = ''.join([i for i in s if i not in frozenset(string.punctuation)])
    return s

df['cleaned'] = df['old'].apply(remove_punctuation)

edited Oct 09 '15 at 22:59

answered Oct 09 '15 at 22:13

mechanical_meat

163,903
24
228
223

You're very welcome! – mechanical_meat Oct 09 '15 at 22:21
You can accept this answer if it works for you. – Mukesh Thawani Oct 09 '15 at 22:43
One improvement here is to use set(string.punctuation) rather than string.punctuation in remove_punctuation. – Andy Hayden Oct 09 '15 at 22:49
Thanks, Andy. I'll add that in. – mechanical_meat Oct 09 '15 at 22:50
square brackets/list comprehension around the join give you another boost fwiw :) – Andy Hayden Oct 09 '15 at 22:54
Really! Now I've learned something. – mechanical_meat Oct 09 '15 at 23:01
@bernie There's a classic answer somewhere on SO about it, can't find it. IIRC ~50% faster as python needs to list anyway to calculate the length of the string. – Andy Hayden Oct 09 '15 at 23:14
@AndyHayden Oh I see. Thank you for the tips! – mechanical_meat Oct 09 '15 at 23:16

score 4 · Answer 2 · answered Oct 09 '15 at 22:42

4

Using a regex will most likely be faster here:

In [11]: RE_PUNCTUATION = '|'.join([re.escape(x) for x in string.punctuation])  # perhaps this is available in the re/regex library?

In [12]: s = pd.Series(["a..b", "c<=d", "e|}f"])

In [13]: s.str.replace(RE_PUNCTUATION, "")
Out[13]:
0    ab
1    cd
2    ef
dtype: object

answered Oct 09 '15 at 22:42

Andy Hayden

359,921
101
625
535

1

this should be the accepted answer... – clg4 Jul 08 '16 at 23:52
1

Similarly: `s.str.replace('[{}]'.format(string.punctuation), '')` – David C Aug 09 '17 at 20:47

remove punctuation for each row in a pandas data frame

2 Answers2