"Begin" & "End" date based on values in a row

Question

I have an example of input data that can be found here

input

I need to add 2 columns: "Begin_date" & "End_date" based on the data in each row:

Begin date - when all previous cells are empty and date starts from yyyymm01
End_date - when all subsequent are empty:
if all subsequent are not null, then something like "lifelong" date added: "99991231"
otherwise - yyyymm30 or 31 or 28 (depends on month)

Example of output:

I will appreciate any ideas :) Thank you

Please can you edit your question to include the data as text rather than as an image (e.g. the text output from `df.head()` or similar)? It's unclear from what you've posted here what's actually in the "blank" cells - is it NA, is it an empty string? — asongtoruin, Jul 08 '19 at 10:36

score 3 · Accepted Answer · answered Jul 08 '19 at 10:59

use pd.melt()
sort data by id and date

import pandas as pd
import numpy as np
from pandas.tseries.offsets import MonthEnd

df = pd.read_excel("input.xlsx")
max_date = df.columns[-1]

res = pd.melt(df, id_vars=['id', 'region'], value_vars=df.columns[2:])
res.dropna(subset=['value'], inplace=True)

res.sort_values(by=['id', 'variable'], ascending=[True, True], inplace=True)

minimum_date = res.drop_duplicates(subset=['id'], keep='first')
maximum_date = res.drop_duplicates(subset=['id'], keep='last')

minimum_date.rename(columns={'variable': 'start_date'}, inplace=True)
maximum_date.rename(columns={'variable': 'end_date'}, inplace=True)

df = pd.merge(df, minimum_date[['id', 'start_date']], on=['id'], how='left')
df = pd.merge(df, maximum_date[['id', 'end_date']], on=['id'], how='left')

df['end_date'] = np.where(df['end_date']==max_date,
                          "99991231",df['end_date'])

df['start_date'] = (pd.to_datetime(df['start_date'],format="%Y%m",errors='coerce') +MonthEnd(1)).astype(str)
df['end_date'] = (pd.to_datetime(df['end_date'],format="%Y%m",errors='coerce') +MonthEnd(1)).astype(str)

df['end_date'] = np.where(df['end_date']=='NaT',
                          "99991231",df['end_date'])
print(df)

      id  region  201801  201802  ...  201905  201906  start_date    end_date
0  100001     628     NaN     NaN  ...    26.0    23.0  2018-09-30    99991231
1  100002    1149    27.0    24.0  ...    26.0    24.0  2018-01-31    99991231
2  100003    1290    26.0    26.0  ...    27.0    25.0  2018-01-31    99991231
3  100004     955    25.0    26.0  ...     NaN     NaN  2018-01-31  2018-12-31
4  100005    1397    15.0    25.0  ...     NaN     NaN  2018-01-31  2018-11-30
5  100006    1397    15.0    25.0  ...     NaN     NaN  2018-01-31  2019-02-28

Thank you very much @tawab_shakeel! This "MonthEnd" date-functionality is very helpful! I didn't know about it. Thanks! — Vero, Jul 08 '19 at 12:42
@vero approve any answer which you find helpful so that more people can benefit from it — tawab_shakeel, Jul 08 '19 at 12:52

jezrael · Answer 2 · 2019-07-08T11:14:16.007

Idea is convert non datetimelike columns to MultiIndex by DataFrame.set_index and then convert columns to datetimes:

df = pd.read_excel('input.xlsx')

df = df.set_index(['id','region'])
df.columns = pd.to_datetime(df.columns, format='%Y%m')

Then create new columns by DataFrame.assign, filter January columns, compare non missing values and get first value by DataFrame.idxmax, then convert to output format by Series.dt.strftime for begin, for end first swap order with indexing ::-1 and get last non missing values, convert to last days of month and last get default value if last column is not missing value by Series.where:

begin = df.loc[:, df.columns.month == 1].notna().idxmax(axis=1).dt.strftime('%Y%m%d')
end1 = df.iloc[:, ::-1].notna().idxmax(axis=1) + pd.offsets.MonthEnd()

end = end1.dt.strftime('%Y%m%d').where(df.iloc[:, -1].isna(), '99991231')

df.columns = df.columns.strftime('%Y%m')
df = df.assign(date_begin = begin, date_end =  end).reset_index()

print (df)
       id  region  201801  201802  201803  201804  201805  201806  201807  \
0  100001     628     NaN     NaN     NaN     NaN     NaN     NaN     NaN   
1  100002    1149    27.0    24.0    27.0    25.0    24.0    26.0    27.0   
2  100003    1290    26.0    26.0    26.0    26.0    23.0    27.0    27.0   
3  100004     955    25.0    26.0    26.0    24.0    24.0    26.0    28.0   
4  100005    1397    15.0    25.0    26.0    24.0    21.0    27.0    27.0   
5  100006    1397    15.0    25.0    26.0    24.0    21.0    27.0    27.0   

   201808  ...  201811  201812  201901  201902  201903  201904  201905  \
0     NaN  ...      24    20.0    26.0    24.0    26.0    26.0    26.0   
1    28.0  ...      24    21.0    26.0    25.0    27.0    24.0    26.0   
2     NaN  ...      28     NaN    28.0    26.0    27.0    27.0    27.0   
3    27.0  ...      24    12.0     NaN     NaN     NaN     NaN     NaN   
4    26.0  ...      25     NaN     NaN     NaN     NaN     NaN     NaN   
5    26.0  ...      25    23.0    25.0    17.0     NaN     NaN     NaN   

   201906  date_begin  date_end  
0    23.0    20190101  99991231  
1    24.0    20180101  99991231  
2    25.0    20180101  99991231  
3     NaN    20180101  20181231  
4     NaN    20180101  20181130  
5     NaN    20180101  20190228  

[6 rows x 22 columns]

Also is possible create valid datatimes in both new column by Timestamp.max with Timestamp.floor:

df = pd.read_excel('input.xlsx')

df = df.set_index(['id','region'])
df.columns = pd.to_datetime(df.columns, format='%Y%m')

begin = df.loc[:, df.columns.month == 1].notna().idxmax(axis=1)
end1 = df.iloc[:, ::-1].notna().idxmax(axis=1) + pd.offsets.MonthEnd()

end = end1.where(df.iloc[:, -1].isna(), pd.Timestamp.max.floor('d'))

df.columns = df.columns.strftime('%Y%m')
df = df.assign(date_begin = begin, date_end = end).reset_index()

print (df)
       id  region  201801  201802  201803  201804  201805  201806  201807  \
0  100001     628     NaN     NaN     NaN     NaN     NaN     NaN     NaN   
1  100002    1149    27.0    24.0    27.0    25.0    24.0    26.0    27.0   
2  100003    1290    26.0    26.0    26.0    26.0    23.0    27.0    27.0   
3  100004     955    25.0    26.0    26.0    24.0    24.0    26.0    28.0   
4  100005    1397    15.0    25.0    26.0    24.0    21.0    27.0    27.0   
5  100006    1397    15.0    25.0    26.0    24.0    21.0    27.0    27.0   

   201808  ...  201811  201812  201901  201902  201903  201904  201905  \
0     NaN  ...      24    20.0    26.0    24.0    26.0    26.0    26.0   
1    28.0  ...      24    21.0    26.0    25.0    27.0    24.0    26.0   
2     NaN  ...      28     NaN    28.0    26.0    27.0    27.0    27.0   
3    27.0  ...      24    12.0     NaN     NaN     NaN     NaN     NaN   
4    26.0  ...      25     NaN     NaN     NaN     NaN     NaN     NaN   
5    26.0  ...      25    23.0    25.0    17.0     NaN     NaN     NaN   

   201906  date_begin   date_end  
0    23.0  2019-01-01 2262-04-11  
1    24.0  2018-01-01 2262-04-11  
2    25.0  2018-01-01 2262-04-11  
3     NaN  2018-01-01 2018-12-31  
4     NaN  2018-01-01 2018-11-30  
5     NaN  2018-01-01 2019-02-28  

[6 rows x 22 columns]

Thank you @jezrael! I always learn something new with your codes! — Vero, Jul 09 '19 at 14:49

"Begin" & "End" date based on values in a row

2 Answers2

Linked