0

Existing dataframe :

df_1

Id         dates              time(sec)_1           time(sec)_2
1        02/02/2022             15                     20
1        04/02/2022             20                     30
1        03/02/2022             30                     40
1        06/02/2022             50                     40
2        10/02/2022             10                     10
2        11/02/2022             15                     20

df_2

Id          min_date          action_date         
1          02/02/2022          04/02/2022 
2          06/02/2022          10/02/2022     

Expected Dataframe : df_2

Id          min_date          action_date        count_of_dates     avg_time_1        avg_time_2
1          02/02/2022          04/02/2022              3               21.67              30
2          06/02/2022          10/02/2022              1               10                 10

count of dates, avg_time_1 , avg_time_2 is to be created from the df_1. count of dates is calculated considering the min_date and action_date i.e. number of dates from from df_1 falling under min_date and action_date.

avg_time_1 and avg_time_2 are calculated w.r.t. to count of dates

stuck with applying the condition for dates :-( any leads.?

Romi
  • 181
  • 7

1 Answers1

0

If small data is possible filter per rows by custom function:

df_1['dates'] = df_1['dates'].apply(pd.to_datetime)

df_2[['min_date','action_date']] = df_2[['min_date','action_date']].apply(pd.to_datetime)

def f(x):
    m = df_1['Id'].eq(x['Id']) & df_1['dates'].between(x['min_date'], x['action_date'])
    s = df_1.loc[m, ['time(sec)_1','time(sec)_2']].mean()
    
    return pd.Series([m.sum()] + s.to_list(), index=['count_of_dates'] + s.index.tolist())

df = df_2.join(df_2.apply(f, axis=1))
print (df)
   Id   min_date action_date  count_of_dates  time(sec)_1  time(sec)_2
0   1 2022-02-02  2022-04-02             3.0    21.666667         30.0
1   2 2022-06-02  2022-10-02             1.0    10.000000         10.0

If Id in df_2 is unique is possible improve performance by merge df_1 with aggregate size and mean:

df = df_2.merge(df_1, on='Id')

d = {'count_of_dates':('Id','size'),
     'time(sec)_1':('time(sec)_1','mean'),
     'time(sec)_2':('time(sec)_2','mean')}
df =  df_2.join(df[df['dates'].between(df['min_date'], df['action_date'])]
                              .groupby('Id').agg(**d), on='Id')
print (df)
   Id   min_date action_date  count_of_dates  time(sec)_1  time(sec)_2
0   1 2022-02-02  2022-04-02               3    21.666667           30
1   2 2022-06-02  2022-10-02               1    10.000000           10
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • thanks jez..! how to get only those rows from df_1 that is in between min_date and action_date and store it in new dataframe – Romi Feb 13 '23 at 13:35
  • @Romi - Exactlly this is in my answer. – jezrael Feb 13 '23 at 14:15
  • in the count_of_dates does it includes both the dates as well..? i mean min_date & action_date. because in few of my cases its not counting the action_date while suming – Romi Feb 13 '23 at 14:30
  • @Romi - Can you check [`Series.between`](http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.between.html) ? It should include first and last datetime – jezrael Feb 14 '23 at 07:26