3

I am having a table with 87 million rows and 5 columns. I have a separate file too, with around 3500 words. I want to check for the words in .txt file and check for that word in 4 columns of the table for each row. If that word is present in any of the columns, then I want to remove those rows. This would help me to reduce the number of rows here considerably. The following is the code I am using,

bad_words = pd.read_csv('badwords.txt')

bad_words.dtypes

words    object
dtype: object

bad_words

        words
    0   word1
    1   word3
    2   word5
    3   word13
    4   word16
data

s.no      column1        column2        column3         column4
1         aaaword1b      aaaword2b      aaaword3b       aaaword4b
2         aaaword5b      aaaword6b      aaaword7b       aaaword8b                         
3         aaaword9b      aaaword10b     aaaword11b      aaaword12b
4         aaaword13b     aaaword14b     aaaword15b      aaaword16b
5         aaaword17b     aaaword18b     aaaword19b      aaaword20b

I want to remove the rows that contain words from the bad word document. The output of this should be,

data

s.no      column1        column2        column3         column4                       
3         aaaword9b       aaaword10b    aaaword11b      aaaword12b
5         aaaword17b      aaaword18b    aaaword19b      aaaword20b

I am trying to do something like,

data[(data['column1'].str.contains("word1|word3|word5|word13|word16")==False)| 
(data['column2'].str.contains("word1|word3|word5|word13|word16")==False)|
(data['column3'].str.contains("word1|word3|word5|word13|word16")==False)]

But I am not sure whether we can do it for the entire 3500 words. Also not sure whether this is the efficient way to do for 87 million rows.

Updated the question with string patterns rather that the direct words. Sorry for the bad requirement earlier.

Can anybody suggest me a better way to do this?

Thanks

haimen
  • 1,985
  • 7
  • 30
  • 53

2 Answers2

1

You can use apply method to check by row and create a vector indicating whether the row contains anything in the bad_words using the isin method and then subset the original data frame based on the logic vector returned:

data[~data.apply(lambda row: row.isin(bad_words.words).any(), axis = 1)]

#s.no   column1 column2 column3 column4
#2  3     word9  word10  word11  word12
#4  5    word17  word18  word19  word20

For the updated question, here is an option that might work depending on your actual data:

data[~data.apply(lambda row: bad_words.words.apply(lambda w: row.str.contains(w + "(?=\D)").any()).any(), axis = 1)]


# sno      column1     column2     column3     column4
#2  3    aaaword9b  aaaword10b  aaaword11b  aaaword12b
#4  5   aaaword17b  aaaword18b  aaaword19b  aaaword20b
Psidom
  • 209,562
  • 33
  • 339
  • 356
  • 3
    Note that `isin` works on DataFrames, so you don't need to use `apply`: `data[~data.isin(bad_words.words).any(axis=1)]` will suffice. – root Aug 18 '16 at 17:21
  • @Psidom thanks for the the code. I left one requirement in the question. The word in the columns wont be only the word. It would be within a huge string. I am updating the question now. Can you help me in doing this? It's like every rows would be having huge strings and need to check for the words pattern within it and remove it. – haimen Aug 18 '16 at 17:26
  • @Psidom Can you explain what you have wrote in str.contains? I am getting error with my data – haimen Aug 18 '16 at 17:54
  • @Psidom I am getting error: ('nothing to repeat', u'occurred at index 0') as error – haimen Aug 18 '16 at 17:58
  • @Psidom when I use any word instead of w + "(?=\D)", I am getting output. But when I use w there, I am getting nothing to repeat. Can you help me find what is the problem here? – haimen Aug 18 '16 at 18:24
1

I altered your example because word1 is technically in word11 and word12 and I don't think that's what you meant.

Setup

from StringIO import StringIO
import pandas as pd

text_bad_words = """        words
    0   _word1_
    1   _word3_
    2   _word5_
    3   _word13_
    4   _word16_"""

text_data = """s.no      column1        column2        column3         column4
1         aaa_word1_b      aaa_word2_b      aaa_word3_b       aaa_word4_b
2         aaa_word5_b      aaa_word6_b      aaa_word7_b       aaa_word8_b
3         aaa_word9_b      aaa_word10_b     aaa_word11_b      aaa_word12_b
4         aaa_word13_b     aaa_word14_b     aaa_word15_b      aaa_word16_b
5         aaa_word17_b     aaa_word18_b     aaa_word19_b      aaa_word20_b"""

bad_words = pd.read_csv(
    StringIO(text_bad_words), squeeze=True, index_col=0, delim_whitespace=True)
data = pd.read_csv(
    StringIO(text_data), squeeze=True, index_col=0, delim_whitespace=True)

Solution

I'll use regex and contains

regex = r'|'.join(bad_words)
regex

'_word1_|_word3_|_word5_|_word13_|_word16_'

Create boolean mask

mask = data.stack().str.contains(regex).unstack().any(1)
mask

s.no
1     True
2     True
3    False
4     True
5    False
dtype: bool

data.loc[~mask]

enter image description here

piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • thanks for the the code. I left one requirement in the question. The word in the columns wont be only the word. It would be within a huge string. I am updating the question now. Can you help me in doing this? It's like every rows would be having huge strings and need to check for the words pattern within it and remove it. – haimen Aug 18 '16 at 17:31
  • when using data.loc[~data.stack().str.contains(regex).unstack().any(1)] I am getting error: nothing to repeat but when I use something like, data.loc[~data.stack().str.contains('word11|word1').unstack().any(1)] is it working fine. Can you help me understand why this is happening? – haimen Aug 18 '16 at 18:22
  • I think it's a bug in your version of python. http://stackoverflow.com/questions/3675144/regex-error-nothing-to-repeat – piRSquared Aug 18 '16 at 18:31