1

I have a time series data frame for eight Years (2013-2020) has Hourly data, each Year has Nine zones, under each zone two columns("Gen", "Load") as follows:

                            A ZONE         B ZONE               ...   G ZONE       H ZONE           I ZONE         
        date_time           GEN    LOAD    GEN    LOAD          ...    LOAD      GEN    LOAD         GEN    LOAD
    2013-01-01 00:00:00    725.7 5,859.5  312.2 3,194.7        ...    77.1      706.0 227.1        495.0   861.9
    2013-01-01 01:00:00    436.2 450.5    248.0 198.0          ...    865.5     240.7 107.9        640.5 767.3
    2013-01-01 02:00:00    464.5 160.2    144.2 068.3          ...    738.7     044.7 32.7         509.3 700.4
    2013-01-01 03:00:00    169.9 733.8    268.1 869.5          ...    671.7     649.4 951.3        626.8 652.1
    2013-01-01 04:00:00    145.4 553.4    280.2 872.8          ...    761.5     561.0 912.9        552.1 637.3
...                  ...        ...     ...        ...          ...  ...        ...         ...     ...         ...     ...
    2020-12-31 19:00:00    450.9 951.7    371.4 516.3          ...    461.7     808.9 471.4        983.7 447.8
    2020-12-31 20:00:00    553.0 936.5    848.7 233.9          ...    397.3     978.3 404.3        490.9 233.0
    2020-12-31 21:00:00    458.6 735.6    716.8 121.7          ...    385.1     808.0 192.0        131.5 70.1
    2020-12-31 22:00:00    515.8 651.6    693.5 142.4          ...    291.4     826.1 16.8         591.9 863.2
    2020-12-31 23:00:00    218.6 293.4    448.2 14.2           ...    340.6     435.0 897.4        622.5 768.3

What I want is the following:

1- Detect outliers in each column which is more or less three time Standard Deviation of that column and put it in a new column its name "A_gen_outliers" if the there is outliers in "GEN"column under "A Zone" as well as "A_load_outliers" if the there is outliers in "LOAD"column under "A Zone". Number of new columns are 18 columns.

2- A new column represents sum of "Gen" columns

3- A new column represents sum of "Load" columns

4- A new column represents "GEN" column calculate A_GEN_div = cell value/maximum value of "GEN column under A Zone for each year for example 725.7/725.7=1 for the first cell and 436.2/725.1 for second cell and for last cell 218.6/553. etc. and the same for all "GEN" columns and also for "LOAD" columns- proposed names "A_Load_div".

Number of new columns are 18 columns.

Number of total new columns are "18 *2 + 2" columns

Thanks in advance.

gofvonx
  • 1,370
  • 10
  • 20
swhh
  • 87
  • 1
  • 7

1 Answers1

1

I think this might help. Note that this will keep the columns MultiIndex. Your points above seem to imply that you want to flatten your MultIndex. If this is the case, you might want to look at this question.

1:

df.join(df>(3*df.std()), rsuffix='_outlier')

2 and 3:

df.groupby(level=-1, axis=1).sum()

Note that it is not clear from what the first level of the columns MultIndex should be for this.

4:

maxima = df.resample('1Y').max()
maxima.index = maxima.index + pd.DateOffset(hours=23)
maxima = maxima.reindex(df.index, method='bfill')
df.join(df.divide(maxima), rsuffix='_div')
gofvonx
  • 1,370
  • 10
  • 20
  • df.join(df.divide(df.max()), rsuffix='_div') provide the results based on the maximum value in all values the column but I want the maximum value per year to calculate cell value/max. value per year. as the example mention above. – swhh Apr 15 '21 at 15:47
  • @swhh Apologies - I have updated my answer. Does this work? – gofvonx Apr 15 '21 at 16:31
  • sorry for being late to - the last code worked well but if there is a way to remove the word zone of the columns name to compress their names – swhh Apr 15 '21 at 20:05
  • @swhh You can rename the columns with `pd.MultiIndex.set_levels`. See, e.g., this question: https://stackoverflow.com/questions/41221079/rename-multiindex-columns-in-pandas – gofvonx Apr 16 '21 at 08:19