Converting data points over time into total count

Question

I am trying to do some time trend analysis of cyclone data but am having difficulties getting the pd dataframe into a format where I can do that...

Right now the dataframe looks like this:

	date	season	stormid
0	1970-01-04 12:00:00	1970	SH071970
1	1970-01-04 18:00:00	1970	SH071970
2	1970-01-05 00:00:00	1970	SH071970
3	1970-01-05 06:00:00	1970	SH071970
4	1970-01-05 12:00:00	1970	SH071970
5	1970-01-05 18:00:00	1970	SH071970
6	1970-01-06 00:00:00	1970	SH071970
7	1970-01-06 06:00:00	1970	SH071970
8	1970-01-06 12:00:00	1970	SH071970
9	1970-01-06 18:00:00	1970	SH071970
10	1970-01-07 00:00:00	1970	SH071970

And I would like to have all the entries across days somehow combined into a total day count, for example like this:

	start date	end date	days	season	stormid
0	1970-01-04	1970-01-11	15	1970	SH071870

So on and so forth for all the different stormids.

can you explain the `15`? Is this a dummy number? It should rather be `2` if you count the number of days elapsed, or `11` if you count the number of rows in the group — mozway, Dec 09 '22 at 10:46
Yes @mozway this was a dummy number just to show the visuals. Thanks for pointing it out and my next post I will write which are and aren’t dummy. — MateaMar, Dec 11 '22 at 08:54

mozway · Accepted Answer · 2022-12-09T10:52:53.210

If you want to get the number of days between the min and max per group, use numpy.ptp and Timedelta.days:

import numpy as np

df['date'] = pd.to_datetime(df['date'])

df['date'] = pd.to_datetime(df['date'])

out = (df.groupby(['season', 'stormid'], as_index=False)
         .agg(days=('date', lambda s: np.ptp(s).days),
              start_date=('date', lambda s: s.min().normalize()),
              end_date=('date', lambda s: s.max().normalize())
             )
       )

NB. lambda s: np.ptp(s).days is equivalent to lambda s: (s.max()-s.min()).days? If you want to round up use lambda s: (s.max()-s.min()).ceil('D').days (getting 3 days here).

Output:

   season   stormid  days start_date   end_date
0    1970  SH071970     2 1970-01-04 1970-01-07

alternative

df['date'] = pd.to_datetime(df['date'])

out = (df.groupby(['season', 'stormid'], as_index=False)
         .agg(start_date=('date', lambda s: s.min().normalize()),
              end_date=('date', lambda s: s.max().normalize())
             )
         .assign(days=lambda d: d['end_date'].sub(d['start_date']).dt.days)
       )

Output:

   season   stormid start_date   end_date  days
0    1970  SH071970 1970-01-04 1970-01-07     3

counts

If you simply want the number of rows per group (which is not really "days"):

out = (df.groupby(['season', 'stormid'], as_index=False)
         .agg(days=('date', 'size'))
       )

Output:

   season   stormid  days
0    1970  SH071970    11

Thanks! The 'alternative' solution worked great for me! @mozway — MateaMar, Dec 13 '22 at 08:42

Converting data points over time into total count

1 Answers1

alternative

counts