I have the following dataset:
Col1 Col2 Col3 Col4 Col5 Col6 Col7 Col8 Col9 Col10 ...
Col991 Col992 Col993 Col994 Col995 Col996 Col997 Col998 Col999 Col1000
rows
Row1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
Row2 0 0 0 0 0 23 0 0 0 0 ... 0 0 0 0 7 0 0 0 0 0
Row3 97 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
Row4 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
Row5 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
Row496 182 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 116 0 0 0
Row497 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
Row498 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
Row499 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
Row500 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 125 0 0 0
I am trying to remow columns where total number of nonzeros entries is less than 1% of the number of rows.
I can calculate the percentage of nonzeros entries columnwise
(df[df > 0.0].count()/df.shape[0])*100
How should I use this to get df
with those columns where number of columns have nonzeros in more than 1% of the rows only? Further, how should I change code to remove rows where nonzeros is less than 1% of columns?