Fastest way to filter out pandas dataframe rows containing special characters

Question

I have a list special characters. For example

BAD_CHARS = ['.', '&', '\(', '\)', ';', '-']

I want to remove all the rows from a pandas dataframe column containing these special characters. currently I am doing the following

df = '''
        words  frequency
            &         11
    CONDUCTED          3
       (E.G.,          5
   EXPERIMENT          6
         (VS.          5
        (WARD          3
            -         14
        2006;          3
           3D          5
         ABLE          5
     ABSTRACT          3
  ACCOMPANIED          5
     ACTIVITY         11
           AD          5
       ADULTS          6
'''
for char in BAD_CHARS:
    df = df[~df['word'].str.contains(char)]

# Expected Result
        words  frequency
    CONDUCTED          3
   EXPERIMENT          6
           3D          5
         ABLE          5
     ABSTRACT          3
  ACCOMPANIED          5
     ACTIVITY         11
           AD          5
       ADULTS          6

First it is not working and secondly it is not fast i guess. So how can I do that in a faster way ? Thanks

First, don't escape the braces. `BAD_CHARS = ['.', '&', '(', ')', ';', '-']`. Next, you can either use a character class, or use `re.escape`. Something like this. `df[~df['words'].str.contains("[{}]".format(''.join(BAD_CHARS)))]` — cs95, Jan 17 '18 at 13:20

jezrael · Accepted Answer · 2018-01-17T13:26:06.240

5

I believe you need first escape values and then join by | and as @cᴏʟᴅsᴘᴇᴇᴅ pointed remove \ from values in BAD_CHARS:

import re

BAD_CHARS = ['.', '&', '(', ')', ';', '-']
pat = '|'.join(['({})'.format(re.escape(c)) for c in BAD_CHARS])

df = df[~df['words'].str.contains(pat)]
print (df)
          words  frequency
1     CONDUCTED          3
3    EXPERIMENT          6
8            3D          5
9          ABLE          5
10     ABSTRACT          3
11  ACCOMPANIED          5
12     ACTIVITY         11
13           AD          5
14       ADULTS          6

because this return empty frame:

df[~df['word'].str.contains('|'.join(BAD_CHARS))]

edited Jan 17 '18 at 13:26

answered Jan 17 '18 at 13:15

jezrael

822,522
95
1,334
1,252

It returns empty frame :( – muazfaiz Jan 17 '18 at 13:20
The question was closed as a dupe, and I've addressed the specifics of their question** in a comment. Or else, I could have posted the answer myself :/ – cs95 Jan 17 '18 at 13:28
Thanks. How easy it was :) – muazfaiz Jan 17 '18 at 13:30
@cᴏʟᴅsᴘᴇᴇᴅ - I dont understand `Or else, I could have posted the answer myself :/` Do you think I copy your comment answer? I use only part of comment - dont escape it, and add mentioned it. – jezrael Jan 17 '18 at 13:41

Fastest way to filter out pandas dataframe rows containing special characters

1 Answers1