6

I am trying to get a DataFrame from an existing DataFrame containing only the rows where values in a certain column(whose values are strings) do not contain a certain character.

i.e. If the character we don't want is a '('

Original dataframe:

   some_col my_column
0         1      some
1         2      word
2         3    hello(

New dataframe:

   some_col my_column
0         1      some
1         2      word

I have tried df.loc['(' not in df['my_column']], but this does not work since df['my_column'] is a Series object.

I have also tried: df.loc[not df.my_column.str.contains('(')], which also does not work.

piRSquared
  • 285,575
  • 57
  • 475
  • 624
nmog
  • 236
  • 2
  • 10

2 Answers2

9

You're looking for str.isalpha:

df[df.my_column.str.isalpha()]

   some_col my_column
0         1      some
1         2      word

A similar method is str.isalnum, if you want to retain letters and digits.

If you want to handle letters and whitespace characters, use

df[~df.my_column.str.contains(r'[^\w\s]')]

   some_col my_column
0         1      some
1         2      word

Lastly, if you are looking to remove punctuation as a whole, I've written a Q&A here which might be a useful read: Fast punctuation removal with pandas

cs95
  • 379,657
  • 97
  • 704
  • 746
2

If you are looking to filter out just that character:

negation of str.contains

Escape the open paren. Some characters can be interpreted as special regex characters. You can escape them with a backslash.

df[~df.my_column.str.contains('\(')]

   some_col my_column
0         1      some
1         2      word

str.match all non-open-paren

By the way, this is a bad idea! Checking the whole string that it isn't a character with regex is gross.

df[df.my_column.str.match('^[^\(]*$')]

   some_col my_column
0         1      some
1         2      word

Comprehension using in

df[['(' not in x for x in df.my_column]]

   some_col my_column
0         1      some
1         2      word
piRSquared
  • 285,575
  • 57
  • 475
  • 624