0

I hope find there are experts who can help)

There is such a table

X2         X3   X4  Y   Y1
01.02.2019  1   1   1   
02.02.2019  2   2   0   
02.02.2019  2   3   0   
02.02.2019  2   1   1   
03.02.2019  1   2   1   
04.02.2019  2   3   0   
05.02.2019  1   1   1   
06.02.2019  2   2   0   
07.02.2019  1   3   1   
08.02.2019  2   1   1   
09.02.2019  1   2   0   
10.02.2019  2   3   1   
11.02.2019  1   1   0   
12.02.2019  2   2   1   
13.02.2019  1   3   0   
14.02.2019  2   1   1   
15.02.2019  1   2   1   
16.02.2019  2   3   0   
17.02.2019  1   1   1   
18.02.2019  2   2   0

And in column Y1 it is necessary to calculate the moving average of column Y for the last 5 days, but only with filtering by condition X3 and X4. The filter is equal to the current value of the columns for the current row. For example, for the string 02/04/2019 2 3 0 the average will be equal to 0, because for it only the string matches the condition 02.02.2019 2 3 0

How to do this I do not understand, I know that it will be something like

filtered_X4 = df ['X4']. where (condition_1 & condition_2 & condition_3)

But how to set the conditions themselves condition_1,2,3 I do not understand.

Saw many examples when the filter is known, for example condition_1 = df ['X2']. isin ([2, 3, 5]) but that's not what i need, because my condition values change with the string

How to calculate the mean I know

df ['Y1'] = filtered_X4.shift (1) .rolling (window = 999999, min_periods = 1) .mean ()

but can't configure filtering.

add1: This is the result I'm trying to get:

X2          X3  X4  Y   Y1
01.02.2019  1   1   1   NAN
02.02.2019  2   2   0   NAN
02.02.2019  2   3   0   NAN
02.02.2019  2   1   1   NAN
03.02.2019  1   2   1   NAN
04.02.2019  2   3   0   0
05.02.2019  1   1   1   1
06.02.2019  2   2   0   0
07.02.2019  1   3   1   NAN
08.02.2019  2   1   1   NAN
09.02.2019  1   2   0   NAN
10.02.2019  2   3   1   NAN
11.02.2019  1   3   0   1
12.02.2019  2   2   1   NAN
13.02.2019  1   3   0   0
14.02.2019  2   1   1   NAN
15.02.2019  2   2   1   1
16.02.2019  2   3   0   NAN
17.02.2019  1   1   1   NAN
18.02.2019  2   2   0   1

For example, to calculate the average (Y1) of this line:

    X2          X3  X4  Y   Y1
    04.02.2019  2   3   0   

I need to take only the strings from the dateframe with X3 = 2 and X4 = 3 and X2 from 30.01.2019 to 03.02.2019

asymon
  • 187
  • 9
  • 2
    Welcome to stack overflow, this is an interesting question, but I can't quite figure out what you are doing. Could you include the output you are looking for on the sample dataframe in your question? This would help a lot. – run-out May 19 '19 at 05:35
  • added information – asymon May 19 '19 at 07:55

1 Answers1

1

To do this, use .apply()

Convert date to datetime.

df['X2'] = pd.to_datetime(df['X2'], format='%d.%m.%Y')

print(df)

           X2 X3 X4  Y
0  2019-02-01  1  1  1
1  2019-02-02  2  2  0
2  2019-02-02  2  3  0
3  2019-02-02  2  1  1
4  2019-02-03  1  2  1
5  2019-02-04  2  3  0
6  2019-02-05  1  1  1
7  2019-02-06  2  2  0
8  2019-02-07  1  3  1
9  2019-02-08  2  1  1
10 2019-02-09  1  2  0
11 2019-02-10  2  3  1
12 2019-02-11  1  3  0
13 2019-02-12  2  2  1
14 2019-02-13  1  3  0
15 2019-02-14  2  1  1
16 2019-02-15  2  2  1
17 2019-02-16  2  3  0
18 2019-02-17  1  1  1
19 2019-02-18  2  2  0

Using apply and lambda, create a df.loc filter for each row, restricting by date to the previous 5 days, and also for equality in columns X2 and X3, then calculate the mean of 'Y'.

df['Y1'] = df.apply(
    lambda x: df.loc[
        (
            (df.X2 < x.X2)
            & (df.X2 >= (x.X2 + pd.DateOffset(days=-4)))
            & (df.X3 == x.X3)
            & (df.X4 == x.X4)
        ),
        "Y",
    ].mean(),
    axis=1,
)


print(df)

           X2 X3 X4  Y   Y1
0  2019-02-01  1  1  1  NaN
1  2019-02-02  2  2  0  NaN
2  2019-02-02  2  3  0  NaN
3  2019-02-02  2  1  1  NaN
4  2019-02-03  1  2  1  NaN
5  2019-02-04  2  3  0  0.0
6  2019-02-05  1  1  1  1.0
7  2019-02-06  2  2  0  0.0
8  2019-02-07  1  3  1  NaN
9  2019-02-08  2  1  1  NaN
10 2019-02-09  1  2  0  NaN
11 2019-02-10  2  3  1  NaN
12 2019-02-11  1  3  0  1.0
13 2019-02-12  2  2  1  NaN
14 2019-02-13  1  3  0  0.0
15 2019-02-14  2  1  1  NaN
16 2019-02-15  2  2  1  1.0
17 2019-02-16  2  3  0  NaN
18 2019-02-17  1  1  1  NaN
19 2019-02-18  2  2  0  1.0

Y1 result is in dtype float since np.NaN is not compatible with integer series. If you need integers, use the following workaround.

col = 'Y1'

​

df[col] = df[col].fillna(-1)

df[col] = df[col].astype(int)

df[col] = df[col].astype(str)

df[col] = df[col].replace('-1', np.nan)

​

print(df)

           X2 X3 X4  Y   Y1
0  2019-02-01  1  1  1  NaN
1  2019-02-02  2  2  0  NaN
2  2019-02-02  2  3  0  NaN
3  2019-02-02  2  1  1  NaN
4  2019-02-03  1  2  1  NaN
5  2019-02-04  2  3  0    0
6  2019-02-05  1  1  1    1
7  2019-02-06  2  2  0    0
8  2019-02-07  1  3  1  NaN
9  2019-02-08  2  1  1  NaN
10 2019-02-09  1  2  0  NaN
11 2019-02-10  2  3  1  NaN
12 2019-02-11  1  3  0    1
13 2019-02-12  2  2  1  NaN
14 2019-02-13  1  3  0    0
15 2019-02-14  2  1  1  NaN
16 2019-02-15  2  2  1    1
17 2019-02-16  2  3  0  NaN
18 2019-02-17  1  1  1  NaN
19 2019-02-18  2  2  0    1

EDIT

Follow up question, how to apply the above daily with new data and not including old data:

You just need to filter your data to the data range you want to include.

Create a startdate in datetime

startdate = pd.to_datetime('2019-02-13')

Modify the apply function adding in an if condition:

df['Y1'] = df.apply(
    lambda x: (df.loc[
        (
            (df.X2 < x.X2)
            & (df.X2 >= (x.X2 + pd.DateOffset(days=-4)))
            & (df.X3 == x.X3)
            & (df.X4 == x.X4)
        ),
        "Y",
    ].mean()) if x[0] >= startdate else x[3]
    , axis=1
)

**This will only work after the first time you run the apply statement, otherwise you will get an out of index error. **

So run it first without the if condition then thereafter run with the if conditiion.

run-out
  • 3,114
  • 1
  • 9
  • 25
  • 1
    Thank you so much for the answer! 1 more question. How can I do this operation for line 20 without repeating it for the remaining lines? For example, today I did for 19 lines, tomorrow I added 20, how to make an operation with it? – asymon May 19 '19 at 14:43
  • 1
    Thank! that's what I need. :) – asymon May 19 '19 at 18:16
  • One more question. If I have another block, with the same conditions for filtering, but another variable for calculating the average (for example "Z" and "Z1"). How can they be combined into one function? The calculation of several moving averages separately takes a very long time. – asymon Jun 08 '19 at 13:14