My data
is a bit complicated, I separate into 2 sections: (A) Explain data, (B) Desire output
(A) - Explain data:
My data
as follow:
comp date adj_date val
0 a 1999-12-31 NaT 50
1 a 2000-01-31 NaT 51
2 a 2000-02-29 NaT 52
3 a 2000-03-31 NaT 53
4 a 2000-04-30 NaT 54
5 a 2000-05-31 NaT 55
6 a 2000-06-30 NaT 56
----------------------------------
7 a 2000-07-31 2000-01-31 57
8 a 2000-08-31 2000-02-29 58
9 a 2000-09-30 2000-03-31 59
10 a 2000-10-31 2000-04-30 60
11 a 2000-11-30 2000-05-31 61
12 a 2000-12-31 2000-06-30 62
13 a 2001-01-31 2000-07-31 63
14 a 2001-02-28 2000-08-31 64
15 a 2001-03-31 2000-09-30 65
16 a 2001-04-30 2000-10-31 66
17 a 2001-05-31 2000-11-30 67
18 a 2001-06-30 2000-12-31 68
----------------------------------
19 a 2001-07-31 2001-01-31 69
20 a 2001-08-31 2001-02-28 70
21 a 2001-09-30 2001-03-31 71
22 a 2001-10-31 2001-04-30 72
23 a 2001-11-30 2001-05-31 73
24 a 2001-12-31 2001-06-30 74
25 a 2002-01-31 2001-07-31 75
26 a 2002-02-28 2001-08-31 76
27 a 2002-03-31 2001-09-30 77
28 a 2002-04-30 2001-10-31 78
29 a 2002-05-31 2001-11-30 79
30 a 2002-06-30 2001-12-31 80
----------------------------------
31 a 2002-07-31 2002-01-31 81
32 a 2002-08-31 2002-02-28 82
33 a 2002-09-30 2002-03-31 83
34 a 2002-10-31 2002-04-30 84
35 a 2002-11-30 2002-05-31 85
36 a 2002-12-31 2002-06-30 86
37 a 2003-01-31 2002-07-31 87
38 a 2003-02-28 2002-08-31 88
39 a 2003-03-31 2002-09-30 89
40 a 2003-04-30 2002-10-31 90
41 a 2003-05-31 2002-11-30 91
42 a 2003-06-30 2002-12-31 92
----------------------------------
date
: is the actual date, as end of month.adj_date
=date
+MonthEnd(-6)
val
: is given value
I want to create new column val_new
where:
- it is referencing to
val
of previous year December val_new
is then applied todate
as fromdate.July
todate.(year+1).June
, Or equivalently inadj_date
it is fromadj_date.Jan
toadj_date.Dec
(B) - Desire Output:
comp date adj_date val val_new
0 a 1999-12-31 NaT 50 NaN
1 a 2000-01-31 NaT 51 NaN
2 a 2000-02-29 NaT 52 NaN
3 a 2000-03-31 NaT 53 NaN
4 a 2000-04-30 NaT 54 NaN
5 a 2000-05-31 NaT 55 NaN
6 a 2000-06-30 NaT 56 NaN
-------------------------------------------
7 a 2000-07-31 2000-01-31 57 50.0
8 a 2000-08-31 2000-02-29 58 50.0
9 a 2000-09-30 2000-03-31 59 50.0
10 a 2000-10-31 2000-04-30 60 50.0
11 a 2000-11-30 2000-05-31 61 50.0
12 a 2000-12-31 2000-06-30 62 50.0
13 a 2001-01-31 2000-07-31 63 50.0
14 a 2001-02-28 2000-08-31 64 50.0
15 a 2001-03-31 2000-09-30 65 50.0
16 a 2001-04-30 2000-10-31 66 50.0
17 a 2001-05-31 2000-11-30 67 50.0
18 a 2001-06-30 2000-12-31 68 50.0
-------------------------------------------
19 a 2001-07-31 2001-01-31 69 62.0
20 a 2001-08-31 2001-02-28 70 62.0
21 a 2001-09-30 2001-03-31 71 62.0
22 a 2001-10-31 2001-04-30 72 62.0
23 a 2001-11-30 2001-05-31 73 62.0
24 a 2001-12-31 2001-06-30 74 62.0
25 a 2002-01-31 2001-07-31 75 62.0
26 a 2002-02-28 2001-08-31 76 62.0
27 a 2002-03-31 2001-09-30 77 62.0
28 a 2002-04-30 2001-10-31 78 62.0
29 a 2002-05-31 2001-11-30 79 62.0
30 a 2002-06-30 2001-12-31 80 62.0
-------------------------------------------
31 a 2002-07-31 2002-01-31 81 74.0
32 a 2002-08-31 2002-02-28 82 74.0
33 a 2002-09-30 2002-03-31 83 74.0
34 a 2002-10-31 2002-04-30 84 74.0
35 a 2002-11-30 2002-05-31 85 74.0
36 a 2002-12-31 2002-06-30 86 74.0
37 a 2003-01-31 2002-07-31 87 74.0
38 a 2003-02-28 2002-08-31 88 74.0
39 a 2003-03-31 2002-09-30 89 74.0
40 a 2003-04-30 2002-10-31 90 74.0
41 a 2003-05-31 2002-11-30 91 74.0
42 a 2003-06-30 2002-12-31 92 74.0
-------------------------------------------
I have two solutions, but both comes at a cost:
- Solution 1: to create
sub_dec
dataframe where we takeval
ofDec
each year. Then merge back to maindata
. This one works fine, but I don't like this solution because our actual data will involve a lot of merge, and it is not easy and convenient to keep track of all those merges. - Solution 2: (1) I create a lag by shift(7), (2) set other
adj_date
but Jan toNone
, (3) then use groupby with ffill. This solution works nicely, but if there is any missing rows, or the date is not continuous, then the entireoutput
is wrong
create adj_year
:
data['adj_year'] = data['adj_date'].dt.year
cross referencing by shift(7)
:
data['val_new'] = data.groupby('comp')['val'].shift(7)
setting other adj_date
except Jan to be None
:
data.loc[data['adj_date'].dt.month != 1, 'val_new'] = None
using ffill to fill in None by each group of ['comp', 'adj_year']
:
data['val_new'] = data.groupby(['comp', 'adj_year'])['val_new'].ffill()
If you have any suggestion to overcome the drawback of Solution 02, or any other new solution is appreciated.
Thank you