-1

I have been trying to achieve a plot for hours and seem to be going in the wrong direction but can't figure out what's wrong.

Dataset: https://www.kaggle.com/asauve/cdc-us-births-data-19692008

I would like to find out the "mean daily births by date" grouped by months.

and here is what I am doing.

First I calculate the daily mean:

dailyMean = dataFrame.groupby(['year','month','day'])['births'].mean().reset_index()

enter image description here

Then I am plotting the result

plt.figure(figsize=(15,7))
plt.plot(dailyMean.iloc[:,1],dailyMean.iloc[:,3])
plt.xticks([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12],
              ['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec'])
plt.ylabel("mean births")
plt.xlabel("months")
plt.show()

This is what I got

enter image description here

I also tried seaborn lineplot and this is what I got.

enter image description here

But I am expecting something like expected plot enter image description here

What am I doing wrong here?

Jason
  • 9
  • 3
  • It may be useful to use [datetime](https://stackoverflow.com/questions/17978092/combine-date-and-time-columns-using-python-pandas) – Life is Good Jan 25 '21 at 14:54
  • Actually, I tried that too. I converted the year, month and day to datetime but I got a different result. – Jason Jan 25 '21 at 14:58
  • 1
    Do not include data/code/error messages as images. Post the text directly here on SO. – Mr. T Jan 25 '21 at 15:43
  • Sorry, I just edited the question and added a link to the dataset. – Jason Jan 25 '21 at 16:57

2 Answers2

1

I think the primary issue is that your groupby doesn't reflect the grouping shown in the desired plot. Including the 'year' column means you keep separate dates from different years. But the plot looks like it is trying to show what an average year looks like; i.e. dates from different years should be binned together.

So I would instead try just grouping by month and day:

dailyMean = df.groupby(['month','day'])['births'].mean().reset_index()

#    month  day    births
# 0      1  1.0  4009.225
# 1      1  2.0  4247.400
# 2      1  3.0  4500.900
# 3      1  4.0  4571.350
# 4      1  5.0  4603.625

And then (I would agree with others that) converting to timestamps might be the most flexible for plotting. You could make timestamps out of your data using an arbitrary year (I choose 2000 as it is a leap year):

dates = pd.to_datetime(dict(year=2000, month=dailyMean['month'], day=dailyMean['day']))

Then you can just plot the births against the dates, and use matplotlib.dates to do a little additional formatting:

import matplotlib.dates as mdates
import matplotlib.pyplot as plt
import pandas as pd

plt.style.use('seaborn')
fig, ax = plt.subplots()
ax.plot(dates, dailyMean['births'])
ax.xaxis.set_major_formatter(mdates.DateFormatter('%b'))
ax.xaxis.set_major_locator(mdates.MonthLocator())
ax.set_xlim(dates.iloc[0], dates.iloc[-1])

enter image description here

Note that this looks a little different from your expected graph; I think the major thing is that mine includes February 29th and it looks like yours doesn't. It looked like your data came from here. I had to do some cleaning, and maybe you have done different. But here's my full example:

import matplotlib.dates as mdates
import matplotlib.pyplot as plt
import pandas as pd


# load the data from https://www.kaggle.com/asauve/cdc-us-births-data-19692008
df = pd.read_csv('births.csv')


# get what each day of the year looks like on average
dailyMean = df.groupby(['month','day'])['births'].mean().reset_index()

# clean data by removing rows that aren't real dates
dr = pd.date_range('01-01-2020', '12-31-2020', freq='D')
realdates = pd.Series(tuple(zip(dr.month, dr.day)))
dfdates = pd.Series(tuple(zip(dailyMean['month'], dailyMean['day'].astype(int))))
dailyMean = dailyMean[dfdates.isin(realdates)]

# make a new dates series for plotting
dates = pd.to_datetime(dict(year=2000, month=dailyMean['month'], day=dailyMean['day']))

# plot
plt.style.use('seaborn')
fig, ax = plt.subplots()
ax.plot(dates, dailyMean['births'])
ax.xaxis.set_major_formatter(mdates.DateFormatter('%b'))
ax.xaxis.set_major_locator(mdates.MonthLocator())
ax.set_xlim(dates.iloc[0], dates.iloc[-1])
Tom
  • 8,310
  • 2
  • 16
  • 36
  • 1
    Thank you so much @tom I have got the lead. Actually, I did try to groupby month and day. I think my problem was that I did not know how to plot the data. The last two code blocks from your solution did the work. Thank you so much once again for helping me understand the issue. – Jason Jan 25 '21 at 16:59
0

If this is what you want I just tried on some random manual data. Try Using Pandas Plot

df.groupby(['year','month','day'])['births'].mean().plot(kind='line')

enter image description here

Shubham Rajput
  • 192
  • 3
  • 12