0

I'm trying to create new lines between the start_date and end_date, based on a certain ID. My dataframe looks like this:

    id  start_date  end_date
0   477 2020-02     2020-04
1   518 2020-08     2020-08
2   553 2020-03     2020-04
3   925 2020-07     2020-09
4   948 2020-05     2020-06
5   971 2020-02     2020-07
6   987 2020-01     2020-02
7   117 2020-06     2020-07
8   117 2020-05     2020-07
9   133 2020-05     2020-06

For example. If the ID 477 (first one) begins in 2020-02 and ends in 2020-04, i have to create a new dataframe with plus two new lines, like this:

    id  dates
0   477 2020-02
1   477 2020-03
2   477 2020-04
3   553 2020-03
4   553 2020-04

In the end, i'm trying to create a pivot with the following code and structure:

pd.pivot_table(columns=df_churn['end_date'], index=df_churn['id'], aggfunc='count', data=df_churn, fill_value=0)

This is what the table would look like:

        2020-02 2020-03 2020-04 2020-05 2020-06 2020-07 2020-08 2020-09 2020-10
id                                  
477        1       1       1       0       0       0       0       0       0
518        0       0       0       0       0       0       1       0       0
553        0       1       1       0       0       0       0       0       0
bellotto
  • 445
  • 3
  • 13

1 Answers1

0

You'll need import io for this example to work. You may want to clean up the columns to get back to your original name and formats.

data='''    id  start_date  end_date
0   477  2020-02     2020-04
1   518  2020-08     2020-08
2   553  2020-03     2020-04
3   925  2020-07     2020-09
4   948  2020-05     2020-06
5   971  2020-02     2020-07
6   987  2020-01     2020-02
7   117  2020-06     2020-07
8   117  2020-05     2020-07
9   133  2020-05     2020-06'''
df = pd.read_csv(io.StringIO(data), sep=' \s+', engine='python')
df['start_date'] = pd.to_datetime(df['start_date'])
df['end_date'] = pd.to_datetime(df['end_date'])

df = df.set_index('id', drop=True)
df = df.stack().reset_index().drop('level_1', axis=1).rename(columns={0:'dates'})
df['dummy'] = 1
df = df.set_index('dates', drop=True)
df1 = df.groupby(['id', 'dates']).head(1)

date_index2 = pd.date_range(start=df1.index.min(), end=df1.index.max(), freq='MS')


def reindex_groups(x):
    # id = x['id'][0]
    x = x.reindex(date_index2).fillna(0)
    # x['id1'] = id
    return x

dfn = df1.groupby('id').apply(reindex_groups)
dfn.drop('id', axis=1, inplace=True)
dfn.reset_index(inplace=True)
dfn = dfn.astype({"dummy": int})
dfpivot = dfn.pivot(index='id', columns='level_1', values='dummy')
dfpivot 



level_1  2020-01-01  2020-02-01  2020-03-01  2020-04-01  2020-05-01  2020-06-01  2020-07-01  2020-08-01  2020-09-01
id
117               0           0           0           0           1           1           1           0           0
133               0           0           0           0           1           1           0           0           0
477               0           1           0           1           0           0           0           0           0
518               0           0           0           0           0           0           0           1           0
553               0           0           1           1           0           0           0           0           0
925               0           0           0           0           0           0           1           0           1
948               0           0           0           0           1           1           0           0           0
971               0           1           0           0           0           0           1           0           0
987               1           1           0           0           0           0           0           0           0
Jonathan Leon
  • 5,440
  • 2
  • 6
  • 14