Removing all non-alphanumeric chars from a Pandas DataFrame

Asked Sep 04 '18 at 21:58

Active Sep 04 '18 at 22:02

Viewed 3,854 times

I have a large DataFrame in Pandas. Its col column contains text (sequence of words). For each value in this column, I would like all the words to be stripped of all non-alphanumeric characters. Here are some examples of things I want dropped:

, . ' " { } [ ] ( ) ! @ # $ % & * - +
all numbers/digits
symbols like trademark, registered trademark, copyright, etc
all non-English characters

And most importantly, I want the results to be put back where they were. For example, if one field in col has value I'll be there @ 5, no $hit!, the output should be Ill be there no hit and it should be set as the new value for that row/column (Making a copy of the DataFrame is okay). If dropping unwanted characters leads to an empty string, the value should be an empty string.

What is the most efficient way of doing this in Pandas? (The dataframe has about 5 million rows, and each row has an average length of 50 for the col field.)

asked Sep 04 '18 at 21:58

Tapal Goosal

1

`df['col'] = df.col.str.replace(r'[^a-zA-Z ]\s?',r'',regex=True)` – Paolo Sep 04 '18 at 22:20

Removing all non-alphanumeric chars from a Pandas DataFrame

0 Answers0