If I have a dataframe like the following:
df = pd.DataFrame({'val':['a','b','c','d','e','f','g','h'],
'cat':['C','D','D','C','D','D','D','C'],
'num':[1,2,2,1,2,2,2,1],
'cat2':['X','Y','Y','X','Y','Y','Y','X']})
giving:
val cat num cat2
0 a C 1 X
1 b D 2 Y
2 c D 2 Y
3 d C 1 X
4 e D 2 Y
5 f D 2 Y
6 g D 2 Y
7 h C 1 X
You'll notice that we can determine the columns num
and cat2
to be redundant because the values in the rows for cat
, num
and cat2
always match across the columns: C == 1 == X
and D == 2 == Y
.
I'd like to identify the columns that are redundant to ultimately discard them and have just one representation, like below. num
or cat2
instead of cat
would be fine there too.
val cat
0 a C
1 b D
2 c D
3 d C
4 e D
5 f D
6 g D
7 h C
I can't think of a solution that doesn't involve nested loops that get exponentially more expensive with more columns, and I suspect there might be a clever way to address it. Other questions I've seen about redundant data are usually dealing with when values are equal.
Thanks!