0

Currently have a dataset structured the following way:

id_number    start_date    end_date   data1    data2    data3   ...

enter image description here

Basically, I have a whole bunch of id's with a certain date range and then multiple columns of summary data. My problem is that I need yearly totals of the summary data. This means I need to get to a place where I can groupby year on a single occurrence of each document. However, it is not guaranteed that a document exists for a given year, and the date ranges can span multiple years. Any help would be greatly appreciated, I am quite stuck.

Sample dataframe:

df = pd.DataFrame([[1, '3/10/2002', '4/12/2005'], [1, '4/13/2005', '5/20/2005'], [1, '5/21/2005', '8/10/2009'], [2, '2/20/2012', '2/20/2015'], [3, '10/19/2003', '12/12/2012']])
df.columns = ['id_num', 'start', 'end']
df.start = pd.to_datetime(df['start'], format= "%m/%d/%Y")
df.end = pd.to_datetime(df['end'], format= "%m/%d/%Y")
Josh Friedlander
  • 10,870
  • 5
  • 35
  • 75
Stephen Strosko
  • 597
  • 1
  • 5
  • 18
  • In the event that a record spans for more than a year, what does this mean for the totals? / How are you wanting to deal with them? (or is that your question?) – JimmyA Mar 11 '19 at 15:54
  • Looking at your data I'm wondering how you would differentiate the data between years. For instance, `id_number` 43482 has a `start_date` of 2/3/2017 and an `end_date` of 3/10/2019 and `data1` of 119. How do you know what the data is for 2018? I would need more info. – MacItaly Mar 11 '19 at 15:55
  • Sure, so if it spans multiple years, that data should stay the same for the years that it spans. So if an id spans from 2005-2007 and then changes in 2008, the data should be the same for the years 2005, 2006, and 2007, and then change in 2008. – Stephen Strosko Mar 11 '19 at 15:57
  • MacItaly, the assumption here is that the data stays the same over the time period of each entry if that makes sense. – Stephen Strosko Mar 11 '19 at 15:58
  • Have considered grabbing each unique id in a list and then working with each id's data points - that is currently where I am at. – Stephen Strosko Mar 11 '19 at 16:12

2 Answers2

2

Assuming we have a DataFrame df:

   id_num      start        end  value
0       1 2002-03-10 2005-04-12      1
1       1 2005-04-13 2005-05-20      2
2       1 2007-05-21 2009-08-10      3
3       2 2012-02-20 2015-02-20      4
4       3 2003-10-19 2012-12-12      5

we can create a row for each year for our start to end ranges with:

ys = [np.arange(x[0], x[1]+1) for x in zip(df['start'].dt.year, df['end'].dt.year)]

df = (pd.DataFrame(ys, df.index)
     .stack()
     .astype(int)
     .reset_index(1, True)
     .to_frame('year')
     .join(df, how='left')
     .reset_index())

print(df)

Here we're first creating the ys variable with the list of years for each start-end range from our DataFrame, and the df = ... is splitting these year lists into separate rows and joining back to the original DataFrame (very similar to what's done in this post: How to convert column with list of values into rows in Pandas DataFrame).

Output:

    index  year  id_num      start        end  value
0       0  2002       1 2002-03-10 2005-04-12      1
1       0  2003       1 2002-03-10 2005-04-12      1
2       0  2004       1 2002-03-10 2005-04-12      1
3       0  2005       1 2002-03-10 2005-04-12      1
4       1  2005       1 2005-04-13 2005-05-20      2
5       2  2007       1 2007-05-21 2009-08-10      3
6       2  2008       1 2007-05-21 2009-08-10      3
7       2  2009       1 2007-05-21 2009-08-10      3
8       3  2012       2 2012-02-20 2015-02-20      4
9       3  2013       2 2012-02-20 2015-02-20      4
10      3  2014       2 2012-02-20 2015-02-20      4
11      3  2015       2 2012-02-20 2015-02-20      4
12      4  2003       3 2003-10-19 2012-12-12      5
13      4  2004       3 2003-10-19 2012-12-12      5
14      4  2005       3 2003-10-19 2012-12-12      5
15      4  2006       3 2003-10-19 2012-12-12      5
16      4  2007       3 2003-10-19 2012-12-12      5
17      4  2008       3 2003-10-19 2012-12-12      5
18      4  2009       3 2003-10-19 2012-12-12      5
19      4  2010       3 2003-10-19 2012-12-12      5
20      4  2011       3 2003-10-19 2012-12-12      5
21      4  2012       3 2003-10-19 2012-12-12      5

Note: I changed the original ranges to test cases where there are some years missing for some id_num, e.g. for id_num=1 we have years 2002-2005, 2005-2005 and 2007-2009, so we should not get 2006 for id_num=1 in the output (and we don't, so it passes the test)

perl
  • 9,826
  • 1
  • 10
  • 22
0

I've taken your example and added some random values so we have something to work with:

df = pd.DataFrame([[1, '3/10/2002', '4/12/2005'], [1, '4/13/2005', '5/20/2005'], [1, '5/21/2005', '8/10/2009'], [2, '2/20/2012', '2/20/2015'], [3, '10/19/2003', '12/12/2012']])
df.columns = ['id_num', 'start', 'end']
df.start = pd.to_datetime(df['start'], format= "%m/%d/%Y")
df.end = pd.to_datetime(df['end'], format= "%m/%d/%Y")

np.random.seed(0)  # seeding the random values for reproducibility
df['value'] = np.random.random(len(df))

So far we have:

    id_num  start   end     value
0   1   2002-03-10  2005-04-12  0.548814
1   1   2005-04-13  2005-05-20  0.715189
2   1   2005-05-21  2009-08-10  0.602763
3   2   2012-02-20  2015-02-20  0.544883
4   3   2003-10-19  2012-12-12  0.423655

We want values at the end of the year for each given date, whether it is beginning or end. So we will treat all dates the same. We just want date + user + value:

tmp = df[['end', 'value']].copy()
tmp = tmp.rename(columns={'end':'start'})
new = pd.concat([df[['start', 'value']], tmp], sort=True)
new['id_num'] = df.id_num.append(df.id_num)  # doubling the id numbers

Giving us:

    start      value    id_num
0   2002-03-10  0.548814    1
1   2005-04-13  0.715189    1
2   2005-05-21  0.602763    1
3   2012-02-20  0.544883    2
4   2003-10-19  0.423655    3
0   2005-04-12  0.548814    1
1   2005-05-20  0.715189    1
2   2009-08-10  0.602763    1
3   2015-02-20  0.544883    2
4   2012-12-12  0.423655    3

Now we can group by ID number and year:

new = new.groupby(['id_num', new.start.dt.year]).sum().reset_index(0).sort_index()

    id_num  value
start       
2002    1   0.548814
2003    3   0.423655
2005    1   2.581956
2009    1   0.602763
2012    2   0.544883
2012    3   0.423655
2015    2   0.544883

And finally, for each user we expand the range to have every year in between, filling forward missing data:

new = new.groupby('id_num').apply(lambda x: x.reindex(pd.RangeIndex(x.index.min(), x.index.max() + 1)).fillna(method='ffill')).drop(columns='id_num')

             value
id_num      
1   2002    0.548814
    2003    0.548814
    2004    0.548814
    2005    2.581956
    2006    2.581956
    2007    2.581956
    2008    2.581956
    2009    0.602763
2   2012    0.544883
    2013    0.544883
    2014    0.544883
    2015    0.544883
3   2003    0.423655
    2004    0.423655
    2005    0.423655
    2006    0.423655
    2007    0.423655
    2008    0.423655
    2009    0.423655
    2010    0.423655
    2011    0.423655
    2012    0.423655
Josh Friedlander
  • 10,870
  • 5
  • 35
  • 75
  • I believe something is wrong here. I checked this on the DataFrame from my answer, and even though there are no records for `id_num=1` that include `year=2006`, I'm getting `1 2006 5.0` with your code – perl Mar 11 '19 at 17:27