I have a PySpark DataFrame which I group on a field (column) with the purpose of eliminating,per each group, the records, which have a certain value of another field. So for instance, the table looks like
colA colB
'a' 1
'b' 1
'a' 0
'c' 0
here what I'd like is removing the records where there is a duplicate colA and colB is 0, so to obtain
colA colB
'a' 1
'b' 1
'c' 0
row for 'c' remains because I want to remove the 0s only for the duplicated (on colA) rows.
I can't think of a way to achieve this because I'm not proficient with the way to use agg
after a groupBy
, if the expr
is not one of "avg", "max", etc.