Pandas: Get per-year counts for Dateranges spanning multiple years

Question

I have a dataframe with records spanning multiple years:

WarName    |     StartDate     |    EndDate
---------------------------------------------
 'fakewar1'    01-01-1990           02-02-1995
 'examplewar'  05-01-1990           03-07-1998
 (...)
 'examplewar2'  05-07-1999           06-09-2002

I am trying to convert this dataframe to a summary overview of the total wars per year, e.g.:

  Year  |  Number_of_wars
----------------------------
  1989         0
  1990         2
  1991         2
  1992         3
  1994         2

Usually I would use someting like df.groupby('year').count() to get total wars by year, but since I am currently working with ranges instead of set dates that approach wouldn't work.

I am currently writing a function that generates a list of years, and then for each year in the list checks each row in the dataframe and runs a function that checks if the year is within the date-range of that row (returning True if that is the case).

years = range(1816, 2006)
year_dict = {}
for year in years:
for index, row in df.iterrows():
    range = year_in_range(year, row)
    if range = True: 
       year_dict[year] = year_dict.get(year, 0) + 1

This works, but is also seems extremely convoluted. So I was wondering, what am I missing? What would be the canonical 'pandas-way' to solve this issue?

It's ok to have missing rows for years with zero counts, e.g. 1993? but you did include 1989 — smci, May 21 '18 at 10:08
You don't need year_dict, you can just instantiate and directly index into an array `wars[year-1816]`, of length (2006-1816+1) = 191. — smci, May 21 '18 at 10:14

piRSquared · Accepted Answer · 2018-05-21T10:15:22.043

6

Use a comprehension with pd.value_counts

pd.value_counts([
    d.year for s, e in zip(df.StartDate, df.EndDate)
    for d in pd.date_range(s, e, freq='Y')
]).sort_index()

1990    2
1991    2
1992    2
1993    2
1994    2
1995    1
1996    1
1997    1
1999    1
2000    1
2001    1
dtype: int64

Alternate

from functools import reduce

def r(t):
    return pd.date_range(t.StartDate, t.EndDate, freq='Y')

pd.value_counts(reduce(pd.Index.append, map(r, df.itertuples())).year).sort_index()

Setup

df = pd.DataFrame(dict(
    WarName=['fakewar1', 'examplewar', 'feuxwar2'],
    StartDate=pd.to_datetime(['01-01-1990', '05-01-1990', '05-07-1999']),
    EndDate=pd.to_datetime(['02-02-1995', '03-07-1998', '06-09-2002'])
), columns=['WarName', 'StartDate', 'EndDate'])

df

      WarName  StartDate    EndDate
0    fakewar1 1990-01-01 1995-02-02
1  examplewar 1990-05-01 1998-03-07
2    feuxwar2 1999-05-07 2002-06-09

edited May 21 '18 at 10:15

answered May 20 '18 at 23:24

piRSquared

285,575
57
475
624

1

Can you sort the output table by year? – smci May 21 '18 at 10:09
1

Yes. Add a sort_index() to the end – piRSquared May 21 '18 at 10:10
Btw, the list comprehension looks neat to me, is there any reason to use the alternate solution? – smci May 21 '18 at 10:12
It's a matter of preference. Choose whichever makes more sense to you. – piRSquared May 21 '18 at 10:16

score 3 · Answer 2 · edited May 21 '18 at 10:06

3

By using np.unique

x,y = np.unique(sum([list(range(x.year,y.year)) for x,y in zip(df.StartDate,df.EndDate)],[]), return_counts=True)
pd.Series(dict(zip(x,y)))
Out[222]: 
1990    2
1991    2
1992    2
1993    2
1994    2
1995    1
1996    1
1997    1
1999    1
2000    1
2001    1
dtype: int64

edited May 21 '18 at 10:06

smci

32,567
20
113
146

answered May 21 '18 at 00:53

BENY

317,841
20
164
234

smci · Answer 3 · 2018-05-21T10:42:11.547

0

The other answers with pandas are far preferable, but the native Python answer you showed didn't have to be so convoluted; just instantiate and directly index into an array:

wars = [0] * 191 # max(df['EndDate']).year - min(df['StartDate']).year + 1
yr_offset = 1816 # min(df['StartDate']).year

for _, row in df.iterrows():
  for yr in range(row['StartDate'].year-yr_offset, row['EndDate'].year-yr_offset): # or maybe (year+1)
    wars[yr] += 1

edited May 21 '18 at 10:42

answered May 21 '18 at 10:32

smci

32,567
20
113
146

1

To the mystery downvoter, please explain your objection. I noted clearly that pandas is preferable. I was curious what a less convoluted native Python solution was. So I spent a little effort. Why would I bother? – smci May 21 '18 at 11:03

Pandas: Get per-year counts for Dateranges spanning multiple years

3 Answers3

Setup