How to populate date in a dataframe using pandas in python

Question

I have a dataframe with two columns, Case and Date. Here Date is actually the starting date. I want to populate it as a time series, saying add three (month_num) more dates to each case and removing the original ones.

original dataframe:

   Case       Date
0     1 2010-01-01
1     2 2011-04-01
2     3 2012-08-01

after populating dates:

   Case        Date
0     1  2010-02-01
1     1  2010-03-01
2     1  2010-04-01
3     2  2011-05-01
4     2  2011-06-01
5     2  2011-07-01
6     3  2012-09-01
7     3  2012-10-01
8     3  2012-11-01

I tried to declare an empty dataframe with the same column names and data type, and used for loop to loop over Case and month_num, and add rows into the new dataframe.

import pandas as pd

data = [[1, '2010-01-01'],  [2, '2011-04-01'], [3, '2012-08-01']]
 
df = pd.DataFrame(data, columns = ['Case', 'Date'])

df.Date = pd.to_datetime(df.Date)

df_new = pd.DataFrame(columns=df.columns)
df_new['Case'] = pd.to_numeric(df_new['Case'])
df_new['Date'] = pd.to_datetime(df_new['Date'])

month_num = 3

for c in df.Case:
    for m in range(1, month_num+1):
        temp = df.loc[df['Case']==c]
        temp['Date'] = temp['Date'] + pd.DateOffset(months=m)
        df_new = pd.concat([df_new, temp]) 

df_new.reset_index(inplace=True, drop=True)

My code can work, however, when the original dataframe and month_num become large, it took huge time to run. Are there any better ways to do what I need? Thanks a alot!!

score 2 · Answer 1 · answered Apr 27 '22 at 20:12

2

Your performance issue is probably related to the use of pd.concat inside the inner for loop. This answer explains why.

As the answer suggests, you may want to use an external list to collect all the dataframes you create in the for loop, and then concatenate once the list.

answered Apr 27 '22 at 20:12

mattiatantardini

525
1
5
23

Thank you. It helps. I think the update on temp['Date'] inside the for loop also cost lots of time. If I remove this line, it's much much quicker. – Harry Apr 28 '22 at 15:34

score 1 · Accepted Answer · answered Apr 27 '22 at 21:26

1

Given your input data this is what worked on my notebook:

df2=pd.DataFrame()

df2['Date']=df['Date'].apply(lambda x: pd.date_range(start=x, periods=3,freq='M')).explode()

df3=pd.merge_asof(df2,df,on='Date')
df3['Date']=df3['Date']+ pd.DateOffset(days=1)
df3[['Case','Date']]

We create a df2 to which we populate 'Date' with the needed dates coming from the original df
Then df3 resulting of a merge_asof between df2 and df (to populate the 'Case' column)
Finally , we offset the resulting column off 1 day

answered Apr 27 '22 at 21:26

Daniel Weigel

1,097
2
8
14

I heard 'apply' is not the best performance pandas method though... – Daniel Weigel Apr 27 '22 at 21:30
1

Thanks. I use your way to populate 'Date' and 'Case' column and concat them by column. Then join with other dataframe by 'Date' and 'Case'. This way can meet my expectation on time. – Harry Apr 28 '22 at 15:37

How to populate date in a dataframe using pandas in python

2 Answers2