274

I've done some searching and can't figure out how to filter a dataframe by

df["col"].str.contains(word)

however I'm wondering if there is a way to do the reverse: filter a dataframe by that set's compliment. eg: to the effect of

!(df["col"].str.contains(word))

Can this be done through a DataFrame method?

stites
  • 4,903
  • 5
  • 32
  • 43

10 Answers10

522

You can use the invert (~) operator (which acts like a not for boolean data):

new_df = df[~df["col"].str.contains(word)]

where new_df is the copy returned by RHS.

contains also accepts a regular expression...


If the above throws a ValueError or TypeError, the reason is likely because you have mixed datatypes, so use na=False:

new_df = df[~df["col"].str.contains(word, na=False)]

Or,

new_df = df[df["col"].str.contains(word) == False]
fantabolous
  • 21,470
  • 7
  • 54
  • 51
Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
  • 1
    Perfect! I'm SQL-familiar with regex and thought it was different in Python - saw a lot of articles with `re.complies` and told myself I'd get to that later. Looks like I overfit the search and it's just as you say : ) – stites Jun 14 '13 at 14:58
  • 6
    Maybe a full example would be helpful: `df[~df.col.str.contains(word)]` returns a copy of the original dataframe with excluded rows matching the word. – Dennis Golomazov Jun 12 '17 at 18:03
91

I was having trouble with the not (~) symbol as well, so here's another way from another StackOverflow thread:

df[df["col"].str.contains('this|that')==False]
Shaido
  • 27,497
  • 23
  • 70
  • 73
nanselm2
  • 1,397
  • 10
  • 11
  • Can it be combined like this? `df[df["col1"].str.contains('this'|'that')==False and df["col2"].str.contains('foo'|'bar')==True]`? Thanks! – tommy.carstensen May 24 '17 at 10:54
  • Yes, you can. The syntax is explained here: https://stackoverflow.com/questions/22086116/how-do-you-filter-pandas-dataframes-by-multiple-columns – tommy.carstensen May 24 '17 at 11:26
  • 2
    Not to forget that if we want to rwmove rows which contain "|" we should use "\" like `df = df[~df["col"].str.contains('\|')]` – Amir Nov 26 '19 at 08:59
21

You can use Apply and Lambda :

df[df["col"].apply(lambda x: word not in x)]

Or if you want to define more complex rule, you can use AND:

df[df["col"].apply(lambda x: word_1 not in x and word_2 not in x)]
Arash
  • 1,014
  • 1
  • 9
  • 17
  • I think 'in' works for checking equals so 'not in' will check not equals, will not check for not contains. right? – Raman Joshi Aug 17 '21 at 05:05
  • @RamanJoshi please read the question : Search for “does-not-contain” – Arash Aug 18 '21 at 06:44
  • I have read the question carefully, that's why I am saying to you that there is much difference in between "not equals" and "not contains". for eg. if we have list which containing items ["hello", "world", "test"] and if we want to check for "not equals" then text "ello" will return "true" as text is not equals to any of the items. but when we check for "not contains" it should return "false" as one item i.e. "Hello" contains the text "ello". I think you are getting the question wrong. – Raman Joshi Aug 19 '21 at 07:16
  • Hummm, and my code is checking "does not contain" condition, not "equal" condition. What's the problem ? word_1 not in x means "x doesn't contain word_1" or in other word it means "word_1 is not found in x". It has nothing to do with being "equal" !! – Arash Aug 21 '21 at 06:10
20

I hope the answers are already posted

I am adding the framework to find multiple words and negate those from dataFrame.

Here 'word1','word2','word3','word4' = list of patterns to search

df = DataFrame

column_a = A column name from DataFrame df

values_to_remove = ['word1','word2','word3','word4'] 

pattern = '|'.join(values_to_remove)

result = df.loc[~df['column_a'].str.contains(pattern, case=False)]
Noordeen
  • 1,547
  • 20
  • 26
8

I had to get rid of the NULL values before using the command recommended by Andy above. An example:

df = pd.DataFrame(index = [0, 1, 2], columns=['first', 'second', 'third'])
df.ix[:, 'first'] = 'myword'
df.ix[0, 'second'] = 'myword'
df.ix[2, 'second'] = 'myword'
df.ix[1, 'third'] = 'myword'
df

    first   second  third
0   myword  myword   NaN
1   myword  NaN      myword 
2   myword  myword   NaN

Now running the command:

~df["second"].str.contains(word)

I get the following error:

TypeError: bad operand type for unary ~: 'float'

I got rid of the NULL values using dropna() or fillna() first and retried the command with no problem.

Shoresh
  • 2,693
  • 2
  • 16
  • 9
  • 1
    You can also use `~df["second"].astype(str).str.contains(word)` to force conversion to `str`. See https://stackoverflow.com/questions/43568760/pandas-drop-rows-where-column-contains – David C Jan 31 '18 at 21:07
  • 1
    @Shoresh we can also use na = False as a solution of this problem – Vishav Gupta Feb 13 '20 at 10:29
5

To negate your query use ~. Using query has the advantage of returning the valid observations of df directly:

df.query('~col.str.contains("word").values')
rachwa
  • 1,805
  • 1
  • 14
  • 17
  • 1
    This is perfect when you want to chain operations (df .filter(items=['time', ' lat', ' long', ' col']) .rename(columns={' lat': 'lat', ' long': 'lng', ' col': 'col'}) .query('lat != 0.0') .query('lng != 0.0') .query('~col.str.contains("word").values') ) – oekici Oct 26 '22 at 20:09
3

Additional to nanselm2's answer, you can use 0 instead of False:

df["col"].str.contains(word)==0
U13-Forward
  • 69,221
  • 14
  • 89
  • 114
1

somehow '.contains' didn't work for me but when I tried with '.isin' as mentioned by @kenan in the answer (How to drop rows from pandas data frame that contains a particular string in a particular column?) it works. Adding further, if you want to look at the entire dataframe and remove those rows which has the specific word (or set of words) just use the loop below

for col in df.columns:
    df = df[~df[col].isin(['string or string list separeted by comma'])]

just remove ~ to get the dataframe that contains the word

Bhanu Chander
  • 390
  • 1
  • 6
  • 16
0

To compliment to the above question, if someone wants to remove all the rows with strings, one could do:

df_new=df[~df['col_name'].apply(lambda x: isinstance(x, str))]
vasanth
  • 39
  • 1
  • 1
  • 5
0

To add clarity to the top answer, the general pattern for filtering all columns that contain a specific word is:

# Remove any column with "word" in the name
new_df = df.loc[:, ~df.columns.str.contains("word")]

# Filter multiple words
new_df = df.loc[:, ~df.columns.str.contains("word1|word2")]