I have several datasets where I need to discard observations if all the values in a subset of columns are equal to 0 or N (depending on the column). For example:
df = pd.DataFrame({'one':[0,0,1,2,0], 'two':[0,0,0,0,0], 'three':['N','Y','N','Y','N']})
drop_subset = ['one', 'three']
In [4]: df
Out[4]:
one three two
0 0 N 0
1 0 Y 0
2 1 N 0
3 2 Y 0
4 0 N 0
I need to look at just columns one
and three
, then drop rows 0 and 4 only. If it were just a single value rather than two or more, I would use this:
df[~(df[drop_subset] == 0).all(axis=1)]
And it would work fine. But when I try to expand it out:
df[~(df[drop_subset] == 0 or df[drop_subset] == 'N').all(axis=1)]
I get the dreaded:
ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(),a.item(), a.any() or a.all().
I originally thought to treat each of the inner conditions using the any
or all
operators... except that the logic doesn't work that way. I don't need to know if any or all of the values are 0, or separately whether any or all of the values are N. I need to know if all of the values are N or 0. That is, there's nothing I can do with this:
In [71]: (df[drop_subset] == 0)
Out[71]:
one three
0 True False
1 True False
2 False False
3 False False
4 True False
without simultaneously testing for the 'N' value also. And while it seems obvious on this small dataframe that I could test column one
for 0 and column three
for 'N' separately, in reality my drop_subset
includes almost 100 columns, which differ across three different datasets, and without manually encoding all of them I don't know which columns would have 0 and which would have 'N'. All I know for certain is that if an observation has all 0 or 'N' across this subset then I need to drop it.
My last resort is using apply with a lambda, but it would be painfully slow due to the size of the data. Plus I feel like there must be a vectorized way to express this logic in Panadas, and that's what I'm looking for.