Selecting rows based on value counts of TWO columns

Question

My question is similar to Pandas: Selecting rows based on value counts of a particular column but with TWO columns:

This is a very small snippet from the dataframe (The main df contains millions of entries):

            overall vote  verified   reviewTime      reviewerID    productID  
4677505      5.0  NaN      True  11 28, 2017   A2O8EJJBFJ9F1  B00NR2VMNC   
1302483      5.0  NaN      True   04 1, 2017  A1YMYW7EWN4RL3  B001S2PPT0   
5073908      3.0   83      True  02 12, 2016  A3H796UY7GIX0K  B00ULRFQ1A   
200512       5.0  NaN      True  07 14, 2016  A150W68P8PYXZE  B0000DC0T3   
1529831      5.0  NaN      True  12 19, 2013  A28GVVNJUZ3VFA  B002WE3BZ8   
1141922      5.0  NaN     False  12 20, 2008  A2UOHALGF2X77Q  B001CCLBSA   
5930187      3.0    2      True  05 21, 2018  A2CUSR21CZQ6J7  B01DCDG9JC   
1863730      5.0  NaN      True   05 6, 2017  A38A3VQL8RLS8D  B004HKIB6E   
1835030      5.0  NaN      True  06 20, 2016  A30QT3MWWEPNIE  B004D09HRK   
4226935      5.0  NaN      True  12 27, 2015  A3UORFPF49N96B  B00JP12170

Now I want to filter the dataframe so that each reviewerID and productID appears at least k times (lets say k=2) in the final filtered dataframe. In other words: That each user and product has at least k distinct entries/rows.

I would greatly appreciate any help.

score 0 · Accepted Answer · answered Aug 12 '21 at 14:01

Try this way

k=2
df = pd.read_csv('text.csv')
df['count']=1
df_group = df[['reviewerID','productID','count']].groupby(['reviewerID','productID'],as_index=False).sum()
df_group = df_group[df_group['count']>=k]
df_group.drop(['count'],axis=1,inplace=True)
df.drop(['count'],axis=1,inplace=True)
df = df.merge(df_group,on=['reviewerID','productID'])
df

Hope so it will help

Selecting rows based on value counts of TWO columns

1 Answers1