Idea is convert non datetimelike columns to MultiIndex
by DataFrame.set_index
and then convert columns to datetimes:
df = pd.read_excel('input.xlsx')
df = df.set_index(['id','region'])
df.columns = pd.to_datetime(df.columns, format='%Y%m')
Then create new columns by DataFrame.assign
, filter January
columns, compare non missing values and get first value by DataFrame.idxmax
, then convert to output format by Series.dt.strftime
for begin
, for end
first swap order with indexing ::-1
and get last non missing values, convert to last days of month and last get default value if last column is not missing value by Series.where
:
begin = df.loc[:, df.columns.month == 1].notna().idxmax(axis=1).dt.strftime('%Y%m%d')
end1 = df.iloc[:, ::-1].notna().idxmax(axis=1) + pd.offsets.MonthEnd()
end = end1.dt.strftime('%Y%m%d').where(df.iloc[:, -1].isna(), '99991231')
df.columns = df.columns.strftime('%Y%m')
df = df.assign(date_begin = begin, date_end = end).reset_index()
print (df)
id region 201801 201802 201803 201804 201805 201806 201807 \
0 100001 628 NaN NaN NaN NaN NaN NaN NaN
1 100002 1149 27.0 24.0 27.0 25.0 24.0 26.0 27.0
2 100003 1290 26.0 26.0 26.0 26.0 23.0 27.0 27.0
3 100004 955 25.0 26.0 26.0 24.0 24.0 26.0 28.0
4 100005 1397 15.0 25.0 26.0 24.0 21.0 27.0 27.0
5 100006 1397 15.0 25.0 26.0 24.0 21.0 27.0 27.0
201808 ... 201811 201812 201901 201902 201903 201904 201905 \
0 NaN ... 24 20.0 26.0 24.0 26.0 26.0 26.0
1 28.0 ... 24 21.0 26.0 25.0 27.0 24.0 26.0
2 NaN ... 28 NaN 28.0 26.0 27.0 27.0 27.0
3 27.0 ... 24 12.0 NaN NaN NaN NaN NaN
4 26.0 ... 25 NaN NaN NaN NaN NaN NaN
5 26.0 ... 25 23.0 25.0 17.0 NaN NaN NaN
201906 date_begin date_end
0 23.0 20190101 99991231
1 24.0 20180101 99991231
2 25.0 20180101 99991231
3 NaN 20180101 20181231
4 NaN 20180101 20181130
5 NaN 20180101 20190228
[6 rows x 22 columns]
Also is possible create valid datatimes in both new column by Timestamp.max
with Timestamp.floor
:
df = pd.read_excel('input.xlsx')
df = df.set_index(['id','region'])
df.columns = pd.to_datetime(df.columns, format='%Y%m')
begin = df.loc[:, df.columns.month == 1].notna().idxmax(axis=1)
end1 = df.iloc[:, ::-1].notna().idxmax(axis=1) + pd.offsets.MonthEnd()
end = end1.where(df.iloc[:, -1].isna(), pd.Timestamp.max.floor('d'))
df.columns = df.columns.strftime('%Y%m')
df = df.assign(date_begin = begin, date_end = end).reset_index()
print (df)
id region 201801 201802 201803 201804 201805 201806 201807 \
0 100001 628 NaN NaN NaN NaN NaN NaN NaN
1 100002 1149 27.0 24.0 27.0 25.0 24.0 26.0 27.0
2 100003 1290 26.0 26.0 26.0 26.0 23.0 27.0 27.0
3 100004 955 25.0 26.0 26.0 24.0 24.0 26.0 28.0
4 100005 1397 15.0 25.0 26.0 24.0 21.0 27.0 27.0
5 100006 1397 15.0 25.0 26.0 24.0 21.0 27.0 27.0
201808 ... 201811 201812 201901 201902 201903 201904 201905 \
0 NaN ... 24 20.0 26.0 24.0 26.0 26.0 26.0
1 28.0 ... 24 21.0 26.0 25.0 27.0 24.0 26.0
2 NaN ... 28 NaN 28.0 26.0 27.0 27.0 27.0
3 27.0 ... 24 12.0 NaN NaN NaN NaN NaN
4 26.0 ... 25 NaN NaN NaN NaN NaN NaN
5 26.0 ... 25 23.0 25.0 17.0 NaN NaN NaN
201906 date_begin date_end
0 23.0 2019-01-01 2262-04-11
1 24.0 2018-01-01 2262-04-11
2 25.0 2018-01-01 2262-04-11
3 NaN 2018-01-01 2018-12-31
4 NaN 2018-01-01 2018-11-30
5 NaN 2018-01-01 2019-02-28
[6 rows x 22 columns]