I have a h3_15min with 3454 rows
Timestamp KE EH LA PR AG
0 2013-02-27 00:00:00 1.000000 2.0 0.03 201.289993 4.36
1 2013-02-27 00:15:00 0.990000 2.0 0.03 210.070007 4.38
2 2013-02-27 00:30:00 0.950000 2.0 0.02 207.779999 4.35
3 2013-02-27 00:45:00 0.990000 2.0 0.03 151.960007 4.34
4 2013-02-27 01:00:00 341.209991 2.0 0.04 0.000000 4.41
... ... ... ... ... ... ...
3449 2013-04-03 22:15:00 NaN 2.0 0.03 0.000000 NaN
3450 2013-04-03 22:30:00 NaN NaN 0.07 0.000000 NaN
3451 2013-04-03 22:45:00 NaN NaN NaN 0.000000 NaN
3452 2013-04-03 23:00:00 NaN NaN NaN 0.000000 NaN
3453 2013-04-03 23:15:00 NaN NaN NaN 0.000000 NaN
Which is described as this
KE EH LA PR AG
count 3439.000000 3450.000000 3451.000000 3454.000000 3416.000000
mean 7.361526 60.447796 20.266174 17.185938 506.416779
std 48.624306 286.459686 168.753860 59.658848 623.306396
min 0.000000 2.000000 0.000000 0.000000 4.010000
25% 0.970000 2.000000 0.020000 0.000000 170.047501
50% 0.990000 2.000000 0.040000 0.000000 245.834991
75% 0.990000 2.000000 0.080000 0.000000 526.140015
max 652.210022 2199.290039 2214.550049 278.029999 3543.469971
I would like to remove the outliers so that I can calculate the mean and replace the NaN values.
I tried to use the code below derived from [this][1] post:
h3_15min[(np.abs(stats.zscore(h3_15min.loc[:, h3_15min.columns != "Timestamp" ])) < 3)]
Which results in h3_15min with 3272 rows"
Timestamp KE EH LA PR AG
3 2013-02-27 00:45:00 0.990000 2.0 0.03 151.960007 4.34
4 2013-02-27 01:00:00 341.209991 2.0 0.04 0.000000 4.41
5 2013-02-27 01:15:00 1.000000 2.0 0.02 0.000000 4.29
6 2013-02-27 01:30:00 0.990000 2.0 0.04 0.000000 4.19
7 2013-02-27 01:45:00 0.990000 2.0 0.01 0.000000 4.15
... ... ... ... ... ... ...
3449 2013-04-03 22:15:00 NaN 2.0 0.03 0.000000 NaN
3450 2013-04-03 22:30:00 NaN NaN 0.07 0.000000 NaN
3451 2013-04-03 22:45:00 NaN NaN NaN 0.000000 NaN
3452 2013-04-03 23:00:00 NaN NaN NaN 0.000000 NaN
3453 2013-04-03 23:15:00 NaN NaN NaN 0.000000 NaN
It seems like it didn't remove the max outliers but just some random rows.
The outliers are > 1 for KE, > 2 for EH > 1 for LA and > 300 for PR. Any idea on how I can remove the outliers for the dataframe without having to type it manually for every column? My other dataset has 50 columns it would be great if it could be done automatically. [1]: Detect and exclude outliers in Pandas data frame