0

I have a sub-data frame consisting of a set of dates within a dataframe. I would like to calculate the moving average within the same sub-data frame and plot it on the same graph I already have (which displays # of cases per day in the sub frame). The moving average needs to go from March 7 to July 10 and the windows need to =7 (one week).

Example Data:

sex         country      date_report
M           Canada       03-01-2020
F           Canada       03-01-2020
M           Canada       03-02-2020
F           Canada       03-02-2020
M           Canada       03-02-2020
M           Canada       03-03-2020
F           Canada       03-03-2020
M           Canada       03-04-2020
F           Canada       03-04-2020
M           Canada       03-04-2020

The code I already have

day_first=datetime.date(2020, 3, 1)
day_last=datetime.date(2020, 7, 10)
delta = (day_last - day_first)
print(delta.days)

for i in range(delta.days + 1):
  all_dates = day_first + datetime.timedelta(+i)
  print(all_dates)    # This gives me the range of dates I am looking for. 

date_count=df.groupby('date_report').date_report.count()
sub_df = df.loc[df['date_report'].between(day_first,day_last), :]
date_count = sub_df.groupby('date_report').date_report.count()
ax=date_count.plot(kind='line')
ax.xaxis.set_major_locator(months)
plt.xlabel("March 1/2020 to July 10/2020")
plt.ylabel("Number of Cases")
plt.show()

This creates a plot looking like this:

ATTACHED PHOTO OF PLOT

I just need to calculate the weekly moving average within the same sub-data frame and then plot that on the same graph. Thanks in advance for your help and sorry for the screenshot - I couldn't add the pic another way as I'm new to stackoverflow!

Tom
  • 8,310
  • 2
  • 16
  • 36

1 Answers1

0

I think you are looking to use rolling, which can be used to create moving averages. Here is an example trying to emulate yours, with rolling used to add another curve:

import pandas as pd
import numpy as np

#the range of dates you mention
dr = pd.date_range('3-7-2020','7-10-2020',freq='D')

#df with a date_report column which is 1000 randomly chosen dates from dr
df = pd.DataFrame(np.random.choice(dr, size=1000), columns=['date_report'])
 
#your groupby operation and plotting
counts = df.groupby('date_report').date_report.count()
counts.index = counts.index.date
ax = counts.plot(kind='line')
ax.xaxis.set_major_locator(mdates.MonthLocator())
ax.xaxis.set_major_formatter(mdates.DateFormatter('%b'))

#now creating a rolling 7 day window and plotting
counts.index = pd.to_datetime(counts.index)
rolling = counts.rolling('7D').mean()
rolling.plot()

enter image description here

I edited to include some date formatting of the x-axis. There may be a better way, but you can see I do some annoying date conversion of counts for this to work (see my other post here)

Tom
  • 8,310
  • 2
  • 16
  • 36
  • Thanks for this! Could you explain why the Y axis only goes up to 16? I don't really understand what this is showing –  Jul 16 '20 at 20:04
  • @akearney17 it is the number of times those dates occurred in the DataFrame, same as your `date_count=df.groupby('date_report').date_report.count()`. But my dates are random, so they don't have any seasonal pattern like yours. The y values max out at ~15 b/c I suppose it was unlikely for any single date to get picked >15 times. If I picked more than 1000 dates the range would increase. – Tom Jul 16 '20 at 20:14
  • Thank you so much! –  Jul 16 '20 at 20:16
  • @akearney17 Of course! Good luck with it. You can accept the answer if it answers your question (or elaborate if not) : D – Tom Jul 16 '20 at 20:20