111

I have a pandas dataframe, df.

I want to select all indices in df that are not in a list, blacklist.

Now, I use list comprehension to create the desired labels to slice.

ix=[i for i in df.index if i not in blacklist]  
df_select=df.loc[ix]

Works fine, but may be clumsy if I need to do this often.

Is there a better way to do this?

cs95
  • 379,657
  • 97
  • 704
  • 746
lmart999
  • 6,671
  • 10
  • 29
  • 37
  • Possible duplicate of [dropping rows from dataframe based on a "not in" condition](https://stackoverflow.com/questions/27965295/dropping-rows-from-dataframe-based-on-a-not-in-condition) – Jim G. Sep 11 '19 at 18:34

9 Answers9

169

Use isin on the index and invert the boolean index to perform label selection:

In [239]:

df = pd.DataFrame({'a':np.random.randn(5)})
df
Out[239]:
          a
0 -0.548275
1 -0.411741
2 -1.187369
3  1.028967
4 -2.755030
In [240]:

t = [2,4]
df.loc[~df.index.isin(t)]
Out[240]:
          a
0 -0.548275
1 -0.411741
3  1.028967
Hooked
  • 84,485
  • 43
  • 192
  • 261
EdChum
  • 376,765
  • 198
  • 813
  • 562
24

You could use set() to create the difference between your original indices and those that you want to remove:

df.loc[set(df.index) - set(blacklist)]

It has the advantage of being parsimonious, as well as being easier to read than a list comprehension.

ASGM
  • 11,051
  • 1
  • 32
  • 53
4

Thanks to ASGM; I found that I needed to turn the set into a list to make it work with a MultiIndex:

mi1 = pd.MultiIndex.from_tuples([("a", 1), ("a", 2), ("b", 1), ("b", 2)])
df1 = pd.DataFrame(data={"aaa":[1,2,3,4]}, index=mi1)
setValid = set(df1.index) - set([("a", 2)])
df1.loc[list(setValid)] # works
df1.loc[setValid] # fails

(sorry can't comment, insufficient rep)

Hagrid67
  • 344
  • 2
  • 9
4

If you are looking for a way to select all rows that are outside a condition you can use np.invert() given that the condition returns an array of booleans.

df.loc[np.invert(({condition 1}) & (condition 2))]
zmag
  • 7,825
  • 12
  • 32
  • 42
4
df = pd.DataFrame(data=[5,6,7,8], index=[1,2,3,4], columns=['D',])
blacklist = [2,3]

df.drop(blacklist,0)
4b0
  • 21,981
  • 30
  • 95
  • 142
  • 7
    Code-only answers are not particularly helpful. Please include a brief description of how this code solves the problem. – 4b0 Oct 22 '20 at 08:52
  • Please don't post only code as answer, but also provide an explanation what your code does and how it solves the problem of the question. Answers with an explanation are usually more helpful and of better quality, and are more likely to attract upvotes. – ZF007 Oct 22 '20 at 10:38
  • This, in my opinion, is a lot prettier/more elegant than doing `df.loc[~df.index.isin(blacklist)]`. However, it is less interpretable, since usually people only use `drop` to remove columns (so `axis=1`). – Corey Levinson Feb 05 '23 at 17:37
4

You could use difference() to obtain the difference between your original indices and those that you want to exclude:

df.loc[df.index.difference(blacklist), :]

It has the advantage of being easier to read.

1
import pandas as pd
df = pd.DataFrame(data=[5,6,7,8], index=[1,2,3,4], columns=['D',])
blacklist = [2,3]
#your current way ...
ix=[i for i in df.index if i not in blacklist]  
df_select=df.loc[ix]

# use a mask
mask = [True if x else False for x in df.index if x not in blacklist]
df.loc[mask]

http://pandas.pydata.org/pandas-docs/dev/indexing.html#indexing-label actually, loc and iloc both take a boolean array, in this case the mask. from now on you can reuse this mask and should be more efficient.

Dyno Fu
  • 8,753
  • 4
  • 39
  • 64
1

Inside query you can access your variable blacklist using @:

df.query('index != @blacklist')

# Or alternatively:
df.query('index not in @blacklist')
rachwa
  • 1,805
  • 1
  • 14
  • 17
0

You can use the np.setdiff1d function which finds the set difference of two arrays.

index = np.array(blacklist)
not_index = np.setdiff1d(df.index.to_numpy(), index)
df.iloc[not_index]
Dharman
  • 30,962
  • 25
  • 85
  • 135
hamnghi
  • 99
  • 1
  • 7