6

I am new to python so this may be a very basic question. I am trying to use lambda to remove punctuation for each row in a pandas dataframe. I used the following, but received an error. I am trying to avoid having convert the df into a list then append the cleaned results into new list, then convert it back to a df.

Any suggestions would be appreciated!

import string

df['cleaned'] = df['old'].apply(lambda x: x.replace(c,'') for c in string.punctuation)
cs95
  • 379,657
  • 97
  • 704
  • 746
RJL
  • 341
  • 1
  • 7
  • 19

2 Answers2

13

You need to iterate over the string in the dataframe, not over string.punctuation. You also need to build the string back up using .join().

df['cleaned'] = df['old'].apply(lambda x:''.join([i for i in x 
                                                  if i not in string.punctuation]))

When lambda expressions get long like that it can be more readable to write out the function definition separately, e.g. (thanks to @AndyHayden for the optimization tips):

def remove_punctuation(s):
    s = ''.join([i for i in s if i not in frozenset(string.punctuation)])
    return s

df['cleaned'] = df['old'].apply(remove_punctuation)
mechanical_meat
  • 163,903
  • 24
  • 228
  • 223
4

Using a regex will most likely be faster here:

In [11]: RE_PUNCTUATION = '|'.join([re.escape(x) for x in string.punctuation])  # perhaps this is available in the re/regex library?

In [12]: s = pd.Series(["a..b", "c<=d", "e|}f"])

In [13]: s.str.replace(RE_PUNCTUATION, "")
Out[13]:
0    ab
1    cd
2    ef
dtype: object
Andy Hayden
  • 359,921
  • 101
  • 625
  • 535