Filter rows in pandas based on threshold

Question

I have the following data frame.

    A1  A2  A3  B1  B2  B3  C1  C2  C3
0   0   0   1   1   1   1   0   1   1
1   0   0   0   0   0   0   0   0   0
2   1   1   1   0   1   1   1   1   1

I am looking to filter it based on groups of column and occurrence of non-zero. I wrote the following to achieve it.

import pandas as pd
df = pd.read_csv("TEST_TABLE.txt", sep='\t')
print(df)
group1 = ['A1','A2','A3']
group2 = ['B1','B2','B3']
group3 = ['C1','C2','C3']
df2 = df[(df[group1] !=0).any(axis=1) & (df[group2] !=0).any(axis=1) & (df[group3] !=0).any(axis=1)]
print(df2)

The output was perfect:

    A1  A2  A3  B1  B2  B3  C1  C2  C3
0   0   0   1   1   1   1   0   1   1
2   1   1   1   0   1   1   1   1   1

Now, how to modify the code such that, I can impose a threshold value for "any". i.e retain rows for each group with atleast 2 non-zeros. Hence, the final output will give

   A1  A2  A3  B1  B2  B3  C1  C2  C3
2   1   1   1   0   1   1   1   1   1

Thanks in advance.

score 2 · Accepted Answer · answered Sep 06 '18 at 08:39

2

You can create boolean masks in loop by sum for count non 0 values with comparing by ge (>=) and last reduce masks:

groups = [group1,group2,group3]
df2 = df[np.logical_and.reduce([(df[g]!=0).sum(axis=1).ge(2) for g in groups])]

print(df2)
   A1  A2  A3  B1  B2  B3  C1  C2  C3
2   1   1   1   0   1   1   1   1   1

Detail:

print([(df[g]!=0).sum(axis=1).ge(2) for g in groups])

[0    False
1    False
2     True
dtype: bool, 0     True
1    False
2     True
dtype: bool, 0     True
1    False
2     True
dtype: bool]

answered Sep 06 '18 at 08:39

jezrael

822,522
95
1,334
1,252

I tried by replacing "sum(axis=1)" to count(axis=1) which gives table same as df. This means that count will always be true because the number of rows in each group is 3 ? – Arun Sep 06 '18 at 08:47
1

@Arun - [`count`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.count.html) function is used for get count of non NaNs values, here is used `sum` for count Trues values, which are processes like `1`s – jezrael Sep 06 '18 at 10:25

Filter rows in pandas based on threshold

1 Answers1