0

I have a h3_15min with 3454 rows

     Timestamp              KE          EH   LA     PR          AG
0    2013-02-27 00:00:00    1.000000    2.0  0.03   201.289993  4.36
1    2013-02-27 00:15:00    0.990000    2.0  0.03   210.070007  4.38
2    2013-02-27 00:30:00    0.950000    2.0  0.02   207.779999  4.35
3    2013-02-27 00:45:00    0.990000    2.0  0.03   151.960007  4.34
4    2013-02-27 01:00:00    341.209991  2.0  0.04   0.000000    4.41
... ... ... ... ... ... ...
3449 2013-04-03 22:15:00    NaN         2.0  0.03   0.000000    NaN
3450 2013-04-03 22:30:00    NaN         NaN  0.07   0.000000    NaN
3451 2013-04-03 22:45:00    NaN         NaN  NaN    0.000000    NaN
3452 2013-04-03 23:00:00    NaN         NaN  NaN    0.000000    NaN
3453 2013-04-03 23:15:00    NaN         NaN  NaN    0.000000    NaN

Which is described as this

        KE          EH          LA          PR          AG
count   3439.000000 3450.000000 3451.000000 3454.000000 3416.000000
mean    7.361526    60.447796   20.266174   17.185938   506.416779
std     48.624306   286.459686  168.753860  59.658848   623.306396
min     0.000000    2.000000    0.000000    0.000000    4.010000
25%     0.970000    2.000000    0.020000    0.000000    170.047501
50%     0.990000    2.000000    0.040000    0.000000    245.834991
75%     0.990000    2.000000    0.080000    0.000000    526.140015
max     652.210022  2199.290039 2214.550049 278.029999  3543.469971

I would like to remove the outliers so that I can calculate the mean and replace the NaN values.

I tried to use the code below derived from [this][1] post:

h3_15min[(np.abs(stats.zscore(h3_15min.loc[:, h3_15min.columns != "Timestamp" ])) < 3)]

Which results in h3_15min with 3272 rows"

        Timestamp           KE          EH  LA      PR          AG
3       2013-02-27 00:45:00 0.990000    2.0 0.03    151.960007  4.34
4       2013-02-27 01:00:00 341.209991  2.0 0.04    0.000000    4.41
5       2013-02-27 01:15:00 1.000000    2.0 0.02    0.000000    4.29
6       2013-02-27 01:30:00 0.990000    2.0 0.04    0.000000    4.19
7       2013-02-27 01:45:00 0.990000    2.0 0.01    0.000000    4.15
... ... ... ... ... ... ...
3449    2013-04-03 22:15:00 NaN        2.0  0.03    0.000000    NaN
3450    2013-04-03 22:30:00 NaN        NaN  0.07    0.000000    NaN
3451    2013-04-03 22:45:00 NaN        NaN  NaN     0.000000    NaN
3452    2013-04-03 23:00:00 NaN        NaN  NaN     0.000000    NaN
3453    2013-04-03 23:15:00 NaN        NaN  NaN     0.000000    NaN

It seems like it didn't remove the max outliers but just some random rows.

The outliers are > 1 for KE, > 2 for EH > 1 for LA and > 300 for PR. Any idea on how I can remove the outliers for the dataframe without having to type it manually for every column? My other dataset has 50 columns it would be great if it could be done automatically. [1]: Detect and exclude outliers in Pandas data frame

MxGr20
  • 77
  • 6

1 Answers1

0
outliers = (h3_15min.KE > 1) & (h3_15min.EH > 2) & (h3_15min.LA > 1) & (h3_15min.PR > 300)
no_outliers = h3_15min.loc[~outliers]

should do the trick.

arstep
  • 1
  • 1
  • thank you for the suggestion. Unfortunately, when I ran the code the outliers are still there. – MxGr20 Mar 25 '21 at 13:45