I have a large DataFrame in Pandas. Its col
column contains text (sequence of words). For each value in this column, I would like all the words to be stripped of all non-alphanumeric characters. Here are some examples of things I want dropped:
, . ' " { } [ ] ( ) ! @ # $ % & * - +
- all numbers/digits
- symbols like trademark, registered trademark, copyright, etc
- all non-English characters
And most importantly, I want the results to be put back where they were. For example, if one field in col
has value I'll be there @ 5, no $hit!
, the output should be Ill be there no hit
and it should be set as the new value for that row/column (Making a copy of the DataFrame is okay). If dropping unwanted characters leads to an empty string, the value should be an empty string.
What is the most efficient way of doing this in Pandas? (The dataframe has about 5 million rows, and each row has an average length of 50 for the col
field.)