Search for "does-not-contain" on a DataFrame in pandas

Question

I've done some searching and can't figure out how to filter a dataframe by

df["col"].str.contains(word)

however I'm wondering if there is a way to do the reverse: filter a dataframe by that set's compliment. eg: to the effect of

!(df["col"].str.contains(word))

Can this be done through a DataFrame method?

score 522 · Accepted Answer · edited Sep 04 '22 at 05:09

522

You can use the invert (~) operator (which acts like a not for boolean data):

new_df = df[~df["col"].str.contains(word)]

where new_df is the copy returned by RHS.

contains also accepts a regular expression...

If the above throws a ValueError or TypeError, the reason is likely because you have mixed datatypes, so use na=False:

new_df = df[~df["col"].str.contains(word, na=False)]

Or,

new_df = df[df["col"].str.contains(word) == False]

edited Sep 04 '22 at 05:09

fantabolous

21,470
7
54
51

answered Jun 13 '13 at 21:51

Andy Hayden

359,921
101
625
535

1

Perfect! I'm SQL-familiar with regex and thought it was different in Python - saw a lot of articles with `re.complies` and told myself I'd get to that later. Looks like I overfit the search and it's just as you say : ) – stites Jun 14 '13 at 14:58
6

Maybe a full example would be helpful: `df[~df.col.str.contains(word)]` returns a copy of the original dataframe with excluded rows matching the word. – Dennis Golomazov Jun 12 '17 at 18:03

score 91 · Answer 2 · edited Oct 16 '18 at 07:07

91

I was having trouble with the not (~) symbol as well, so here's another way from another StackOverflow thread:

df[df["col"].str.contains('this|that')==False]

edited Oct 16 '18 at 07:07

Shaido

27,497
23
70
73

answered Dec 15 '16 at 21:10

nanselm2

1,397
10
11

Can it be combined like this? `df[df["col1"].str.contains('this'|'that')==False and df["col2"].str.contains('foo'|'bar')==True]`? Thanks! – tommy.carstensen May 24 '17 at 10:54
Yes, you can. The syntax is explained here: https://stackoverflow.com/questions/22086116/how-do-you-filter-pandas-dataframes-by-multiple-columns – tommy.carstensen May 24 '17 at 11:26
2

Not to forget that if we want to rwmove rows which contain "|" we should use "\" like `df = df[~df["col"].str.contains('\|')]` – Amir Nov 26 '19 at 08:59

Arash · Answer 3 · 2021-08-16T23:21:10.757

21

You can use Apply and Lambda :

df[df["col"].apply(lambda x: word not in x)]

Or if you want to define more complex rule, you can use AND:

df[df["col"].apply(lambda x: word_1 not in x and word_2 not in x)]

edited Aug 16 '21 at 23:21

answered Jan 14 '19 at 03:13

Arash

1,014
1
9
17

I think 'in' works for checking equals so 'not in' will check not equals, will not check for not contains. right? – Raman Joshi Aug 17 '21 at 05:05
@RamanJoshi please read the question : Search for “does-not-contain” – Arash Aug 18 '21 at 06:44
I have read the question carefully, that's why I am saying to you that there is much difference in between "not equals" and "not contains". for eg. if we have list which containing items ["hello", "world", "test"] and if we want to check for "not equals" then text "ello" will return "true" as text is not equals to any of the items. but when we check for "not contains" it should return "false" as one item i.e. "Hello" contains the text "ello". I think you are getting the question wrong. – Raman Joshi Aug 19 '21 at 07:16
Hummm, and my code is checking "does not contain" condition, not "equal" condition. What's the problem ? word_1 not in x means "x doesn't contain word_1" or in other word it means "word_1 is not found in x". It has nothing to do with being "equal" !! – Arash Aug 21 '21 at 06:10

score 20 · Answer 4 · edited Apr 16 '22 at 22:02

20

I hope the answers are already posted

I am adding the framework to find multiple words and negate those from dataFrame.

Here 'word1','word2','word3','word4' = list of patterns to search

df = DataFrame

column_a = A column name from DataFrame df

values_to_remove = ['word1','word2','word3','word4'] 

pattern = '|'.join(values_to_remove)

result = df.loc[~df['column_a'].str.contains(pattern, case=False)]

edited Apr 16 '22 at 22:02

answered Feb 08 '19 at 13:37

Noordeen

1,547
20
26

this was the only method that worked for me – brygid Sep 08 '21 at 06:27

Shoresh · Answer 5 · 2016-11-30T18:23:20.337

8

I had to get rid of the NULL values before using the command recommended by Andy above. An example:

df = pd.DataFrame(index = [0, 1, 2], columns=['first', 'second', 'third'])
df.ix[:, 'first'] = 'myword'
df.ix[0, 'second'] = 'myword'
df.ix[2, 'second'] = 'myword'
df.ix[1, 'third'] = 'myword'
df

    first   second  third
0   myword  myword   NaN
1   myword  NaN      myword 
2   myword  myword   NaN

Now running the command:

~df["second"].str.contains(word)

I get the following error:

TypeError: bad operand type for unary ~: 'float'

I got rid of the NULL values using dropna() or fillna() first and retried the command with no problem.

edited Nov 30 '16 at 18:23

answered Nov 22 '16 at 22:06

Shoresh

2,693
2
16
9

1

You can also use `~df["second"].astype(str).str.contains(word)` to force conversion to `str`. See https://stackoverflow.com/questions/43568760/pandas-drop-rows-where-column-contains – David C Jan 31 '18 at 21:07
1

@Shoresh we can also use na = False as a solution of this problem – Vishav Gupta Feb 13 '20 at 10:29

score 5 · Answer 6 · answered Apr 16 '22 at 21:09

5

To negate your query use ~. Using query has the advantage of returning the valid observations of df directly:

df.query('~col.str.contains("word").values')

answered Apr 16 '22 at 21:09

rachwa

1,805
1
14
17

1

This is perfect when you want to chain operations (df .filter(items=['time', ' lat', ' long', ' col']) .rename(columns={' lat': 'lat', ' long': 'lng', ' col': 'col'}) .query('lat != 0.0') .query('lng != 0.0') .query('~col.str.contains("word").values') ) – oekici Oct 26 '22 at 20:09

score 3 · Answer 7 · answered Oct 16 '18 at 07:01

3

Additional to nanselm2's answer, you can use 0 instead of False:

df["col"].str.contains(word)==0

answered Oct 16 '18 at 07:01

U13-Forward

69,221
14
89
114

it looks like this also remove any rows with `NaN` – bshelt141 Jan 07 '19 at 20:17

score 1 · Answer 8 · answered Jun 15 '22 at 12:03

somehow '.contains' didn't work for me but when I tried with '.isin' as mentioned by @kenan in the answer (How to drop rows from pandas data frame that contains a particular string in a particular column?) it works. Adding further, if you want to look at the entire dataframe and remove those rows which has the specific word (or set of words) just use the loop below

for col in df.columns:
    df = df[~df[col].isin(['string or string list separeted by comma'])]

just remove ~ to get the dataframe that contains the word

score 0 · Answer 9 · edited Apr 16 '22 at 22:03

0

To compliment to the above question, if someone wants to remove all the rows with strings, one could do:

df_new=df[~df['col_name'].apply(lambda x: isinstance(x, str))]

edited Apr 16 '22 at 22:03

answered Aug 05 '21 at 14:28

vasanth

39
1
1
5

score 0 · Answer 10 · answered Mar 02 '23 at 21:24

To add clarity to the top answer, the general pattern for filtering all columns that contain a specific word is:

# Remove any column with "word" in the name
new_df = df.loc[:, ~df.columns.str.contains("word")]

# Filter multiple words
new_df = df.loc[:, ~df.columns.str.contains("word1|word2")]

Search for "does-not-contain" on a DataFrame in pandas

10 Answers10

Linked

Related