1

Objective: I need to show the trend in ageing of issues. e.g. for each date in 2021 show the average age of the issues that were open as at that date.

Starting data (Historic issue list):. "df"

ref Created resolved
a1 20/4/2021 15/5/2021
a2 21/4/2021 15/6/2021
a3 23/4/2021 15/7/2021

Endpoint: "df2"

Date Avg_age
1/1/2021 x
2/1/2021 y
3/1/2021 z

where x,y,z are the calculated averages of age for all issues open on the Date.

Tried so far: I got this to work in what feels like a very poor way.

  1. create a date range (pd.date_range(start,finish,freq="D")
  2. I loop through the dates in this range and for each date I filter the "df" dataframe (boolean filtering) to show only issues live on the date in question. Then calc age (date - created) and average for those. Each result appended to a list.
  3. once done I just convert the list into a series for my final result, which I can then graph or whatever.
    hist_dates = pd.date_range(start="2021-01-01",end="2021-12-31"),freq="D")
    result_list = []
    for each_date in hist_dates:
        f1=df.Created < each_date      #filter1
        f2=df.Resolved >= each_date    #filter2
        df['Refdate'] = each_date      #make column to allow refdate-created  
        df['Age']= (df.Refdate - df.Created)
        results_list.append(df[f1 & f2]).Age.mean())

Problems: This works, but it feels sloppy and it doesn't seem fast. The current data-set is small, but I suspect this wouldn't scale well. I'm trying not to solve everything with loops as I understand it is a common mistake for beginners like me.

Burgertron
  • 76
  • 6
  • You should include the code for your solution which you think is deficient - it along with your written explanation will help the reader understand what you are trying to do. A sample dataset to work with will help potential answerers test your solution and craft one of their own. [How to make good reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) – wwii May 10 '22 at 13:52

1 Answers1

1

I'll give you two solutions: the first one is step-by-step for you to understand the idea and process, the second one replicates the functionality in a much more condensed way, skipping some intermediate steps


First, create a new column that holds your issue age, i.e. df['age'] = df.resolved - df.Created (I'm assuming your columns are of datetime type, if not, use pd.to_datetime to convert them)

You can then use groupby to group your data by creation date. This will internally slice your dataframe into several pieces, one for each distinct value of Created, grouping all values with the same creation date together. This way, you can then use aggregation on a creation date level to get the average issue age like so

# [['Created', 'age']] selects only the columns you are interested in
df[['Created', 'age']].groupby('Created').mean()

With an additional fourth data point [a4, 2021/4/20, 2021/4/30] (to enable some proper aggregation), this would end up giving you the following Series with the average issue age by creation date:

                        age
Created                    
2021-04-20 17 days 12:00:00
2021-04-21 55 days 00:00:00
2021-04-23 83 days 00:00:00

A more condensed way of doing this is by defining a custom function and apply it to each creation date grouping

def issue_age(s: pd.Series):
    return (s['resolved'] - s['Created']).mean()

df.groupby('Created').apply(issue_age)

This call will give you the same Series as before.

Lukas Thaler
  • 2,672
  • 5
  • 15
  • 31
  • Thank you for the response and clear explanation - would this allow me to calc the average age "as at" a range of historical dates? My end-point is to be able to say: as at 1/6/2021 the avg age of open issues was 22.1 days, at 3/6/2021 avg age was 24.2 days and so on - for every day in 2021. – Burgertron May 11 '22 at 23:00
  • If I understand you correctly, your goal can be achieved by filtering your data frame for entries where the `Created` column is older than and the `resolved` column is after your specified date, and then calculate the overall average age on that subset. I don't see an immediate way to vectorize that query, however – Lukas Thaler May 12 '22 at 06:13
  • Thanks - yeah you summarised it better than me: calculate average age on subset (and there would be 365 subsets to calc for each year. At a minimum using apply seems more readable than my looping version. I will have a play with that - thanks again. – Burgertron May 12 '22 at 20:10