Drop observations in Pandas when all values equal multiple criteria

Question

I have several datasets where I need to discard observations if all the values in a subset of columns are equal to 0 or N (depending on the column). For example:

df = pd.DataFrame({'one':[0,0,1,2,0], 'two':[0,0,0,0,0], 'three':['N','Y','N','Y','N']})
drop_subset = ['one', 'three']

In [4]: df
Out[4]:
   one three  two
0    0     N    0
1    0     Y    0
2    1     N    0
3    2     Y    0
4    0     N    0

I need to look at just columns one and three, then drop rows 0 and 4 only. If it were just a single value rather than two or more, I would use this:

df[~(df[drop_subset] == 0).all(axis=1)]

And it would work fine. But when I try to expand it out:

df[~(df[drop_subset] == 0 or df[drop_subset] == 'N').all(axis=1)]

I get the dreaded:

ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(),a.item(), a.any() or a.all().

I originally thought to treat each of the inner conditions using the any or all operators... except that the logic doesn't work that way. I don't need to know if any or all of the values are 0, or separately whether any or all of the values are N. I need to know if all of the values are N or 0. That is, there's nothing I can do with this:

In [71]: (df[drop_subset] == 0)
Out[71]:
     one  three
0   True  False
1   True  False
2  False  False
3  False  False
4   True  False

without simultaneously testing for the 'N' value also. And while it seems obvious on this small dataframe that I could test column one for 0 and column three for 'N' separately, in reality my drop_subset includes almost 100 columns, which differ across three different datasets, and without manually encoding all of them I don't know which columns would have 0 and which would have 'N'. All I know for certain is that if an observation has all 0 or 'N' across this subset then I need to drop it.

My last resort is using apply with a lambda, but it would be painfully slow due to the size of the data. Plus I feel like there must be a vectorized way to express this logic in Panadas, and that's what I'm looking for.

MaxU - stand with Ukraine · Accepted Answer · 2016-11-15T22:02:28.010

3

IIUC you can use DataFrame.isin() method:

In [68]: df[df[drop_subset].isin([0,'N']).all(1)]
Out[68]:
   one three  two
0    0     N    0
4    0     N    0

In [69]: df[~df[drop_subset].isin([0,'N']).all(1)]
Out[69]:
   one three  two
1    0     Y    0
2    1     N    0
3    2     Y    0

edited Nov 15 '16 at 22:02

answered Nov 15 '16 at 21:47

MaxU - stand with Ukraine

205,989
36
386
419

That's the one, thanks. Although I have to say it's disappointing this can't be done with boolean operators. – Jeff Nov 15 '16 at 22:21
Well, your suggestion for `isin` is very clean and fast, so there's no real need. It just seems like something that *should* work within Pandas, which isn't really a problem you run into a lot. – Jeff Nov 15 '16 at 22:36

score 1 · Answer 2 · edited May 23 '17 at 12:07

I think Max nailed it in his answer, isin is naturally what you should use.

In fact, it is what you must use, since df[drop_subset] == 'N'won't work anyways, you'll get:

TypeError: Could not compare ['N'] with block values

You need isin() for this job.

I specifically wanted to point out that the error you got, "truth value of [DataFrame/Series] is ambiguous" comes from the fact that you should always use & instead of and, and | instead of or.

The binary operators & and | will return an element-wise comparison and will therefore return an array. The and and or try to cast it to a single boolean value for comparison, which doesn't exist.

Read more here: Logic operator for boolean indexing in Pandas.

Yeah, I actually used the pipe first, but ran into the block values error. But apparently that's an unrelated issue. — Jeff, Nov 15 '16 at 22:15

Drop observations in Pandas when all values equal multiple criteria

2 Answers2