4

I have a list special characters. For example

BAD_CHARS = ['.', '&', '\(', '\)', ';', '-']

I want to remove all the rows from a pandas dataframe column containing these special characters. currently I am doing the following

df = '''
        words  frequency
            &         11
    CONDUCTED          3
       (E.G.,          5
   EXPERIMENT          6
         (VS.          5
        (WARD          3
            -         14
        2006;          3
           3D          5
         ABLE          5
     ABSTRACT          3
  ACCOMPANIED          5
     ACTIVITY         11
           AD          5
       ADULTS          6
'''
for char in BAD_CHARS:
    df = df[~df['word'].str.contains(char)]

# Expected Result
        words  frequency
    CONDUCTED          3
   EXPERIMENT          6
           3D          5
         ABLE          5
     ABSTRACT          3
  ACCOMPANIED          5
     ACTIVITY         11
           AD          5
       ADULTS          6

First it is not working and secondly it is not fast i guess. So how can I do that in a faster way ? Thanks

muazfaiz
  • 4,611
  • 14
  • 50
  • 88
  • @Zero mark it, please. – cs95 Jan 17 '18 at 13:17
  • 1
    First, don't escape the braces. `BAD_CHARS = ['.', '&', '(', ')', ';', '-']`. Next, you can either use a character class, or use `re.escape`. Something like this. `df[~df['words'].str.contains("[{}]".format(''.join(BAD_CHARS)))]` – cs95 Jan 17 '18 at 13:20
  • If you have issues copying that, just type it out. – cs95 Jan 17 '18 at 13:24

1 Answers1

5

I believe you need first escape values and then join by | and as @cᴏʟᴅsᴘᴇᴇᴅ pointed remove \ from values in BAD_CHARS:

import re

BAD_CHARS = ['.', '&', '(', ')', ';', '-']
pat = '|'.join(['({})'.format(re.escape(c)) for c in BAD_CHARS])

df = df[~df['words'].str.contains(pat)]
print (df)
          words  frequency
1     CONDUCTED          3
3    EXPERIMENT          6
8            3D          5
9          ABLE          5
10     ABSTRACT          3
11  ACCOMPANIED          5
12     ACTIVITY         11
13           AD          5
14       ADULTS          6

because this return empty frame:

df[~df['word'].str.contains('|'.join(BAD_CHARS))]
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • It returns empty frame :( – muazfaiz Jan 17 '18 at 13:20
  • The question was closed as a dupe, and I've addressed the specifics of their question** in a comment. Or else, I could have posted the answer myself :/ – cs95 Jan 17 '18 at 13:28
  • Thanks. How easy it was :) – muazfaiz Jan 17 '18 at 13:30
  • @cᴏʟᴅsᴘᴇᴇᴅ - I dont understand `Or else, I could have posted the answer myself :/` Do you think I copy your comment answer? I use only part of comment - dont escape it, and add mentioned it. – jezrael Jan 17 '18 at 13:41