3

I want to filter some columns in dataframe where there is little to no change in the data throughout, an example plot of one of the columns is shown below:

enter image description here

What I'm doing currently is quite simple and is probably very inefficient.

from collections import Counter

n = data2.shape[0]

for col in data2.columns:
    most_freq = Counter(data2[col]).most_common(1)[0][1]
    print(col, most_freq/n)

It's output:

m0 0.25192519251925194
m1 0.5808580858085809
m2 0.09790979097909791
m3 0.0033003300330033004
m4 0.9713971397139713
m5 1.0
m6 1.0
m7 1.0
m8 1.0
m9 0.9713971397139713
m10 1.0
m11 1.0

As you can see, I'd like to filter out the columns (like m5, m6 etc) which have high volume of constant non-changing values. Is there a better, perhaps some statistical way to do it? I've looked at a similar question but it didn't help much.

Update:

Based on @Kosmo's answer, this seemed to work well for me. At least, it helped me remove the obvious ones with flat lines.

data2 = data2.loc[:, (round(data2.var()) > 0)]
Apoorv Patne
  • 679
  • 7
  • 24

1 Answers1

3

Feature selection via variance threshold.

Variance is a great statistic to use if you want information on the variability

Edit: I think a cleaner solution would be to simply use DataFrame.var() and filter based on that.

Kosmos
  • 497
  • 3
  • 11