Filtering out rows with non-alphanumeric characters

Question

I am trying to get a DataFrame from an existing DataFrame containing only the rows where values in a certain column(whose values are strings) do not contain a certain character.

i.e. If the character we don't want is a '('

Original dataframe:

   some_col my_column
0         1      some
1         2      word
2         3    hello(

New dataframe:

   some_col my_column
0         1      some
1         2      word

I have tried df.loc['(' not in df['my_column']], but this does not work since df['my_column'] is a Series object.

I have also tried: df.loc[not df.my_column.str.contains('(')], which also does not work.

score 9 · Accepted Answer · answered May 30 '18 at 02:32

You're looking for str.isalpha:

df[df.my_column.str.isalpha()]

   some_col my_column
0         1      some
1         2      word

A similar method is str.isalnum, if you want to retain letters and digits.

If you want to handle letters and whitespace characters, use

df[~df.my_column.str.contains(r'[^\w\s]')]

   some_col my_column
0         1      some
1         2      word

Lastly, if you are looking to remove punctuation as a whole, I've written a Q&A here which might be a useful read: Fast punctuation removal with pandas

piRSquared · Answer 2 · 2018-05-30T04:14:20.800

2

If you are looking to filter out just that character:

negation of `str.contains`

Escape the open paren. Some characters can be interpreted as special regex characters. You can escape them with a backslash.

df[~df.my_column.str.contains('\(')]

   some_col my_column
0         1      some
1         2      word

`str.match` all non-open-paren

By the way, this is a bad idea! Checking the whole string that it isn't a character with regex is gross.

df[df.my_column.str.match('^[^\(]*$')]

   some_col my_column
0         1      some
1         2      word

Comprehension using `in`

df[['(' not in x for x in df.my_column]]

   some_col my_column
0         1      some
1         2      word

edited May 30 '18 at 04:14

answered May 30 '18 at 04:07

piRSquared

285,575
57
475
624

This caters to _just_ braces (if that's what OP wanted, then that's fine). – cs95 May 30 '18 at 04:08
I'm highlighting how to escape characters that might be interpreted as regex. – piRSquared May 30 '18 at 04:10
Ah, because that's the reason their `contains` solution didn't work. Carry on then! – cs95 May 30 '18 at 04:11

Filtering out rows with non-alphanumeric characters

2 Answers2

negation of str.contains

str.match all non-open-paren

Comprehension using in

negation of `str.contains`

`str.match` all non-open-paren

Comprehension using `in`