1

I'm trying to get used to using datetime data in Pandas and plotting different comparisons for a given dataset. I'm using the London Air Quality dataset for Ozone to practice and am trying to replicate the chart below (that I've created using a pivot table in Excel) with Pandas and matplotlib.

The chart plots an average of each hours Ozone reading for each location across the entire dataset to see if there is one location which is constantly higher than others or if different locations have the highest Ozone levels at different periods throughout the day.

Essentially, I'm looking to plot the hourly average of Ozone for each location.

I've attempted to reshape the data into a multiindex format and then plot, similar to what I'd do in excel before plotting but am unsure if this is the correct way to approach the problem. Code for reshaping is below. I am still getting used to reshaping so not sure if this is the correct use/I am approaching the problem in the correct way and open to other methods to accomplish this task. Any assistance to accomplish this task would be much appreciated!

enter image description here

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import datetime

data = pd.read_csv('/Users/xx/Downloads/LaqnData.csv')

data['ReadingDateTime'] = pd.to_datetime(data['ReadingDateTime'])

data['Date'] = pd.to_datetime(data['ReadingDateTime']).dt.date
data['Time'] = pd.to_datetime(data['ReadingDateTime']).dt.time

data.set_index(['Date', 'Time'], inplace = True)

hourly_dataframe = data.pivot_table(columns = 'Site', values = 'Value', index = ['Date', 'Time'])

hourly_dataframe.fillna(method = 'ffill', inplace = True)
hourly_dataframe[hourly_dataframe < 0] = 0
CJ90
  • 99
  • 1
  • 10

1 Answers1

1

I have gone to the site and downloaded a 24 hour reading for the following;

data.Site.unique()

array(['BX1', 'TH4', 'BT4', 'HI0', 'BL0', 'RD0'], dtype=object)

I adopted your code to this point:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import datetime

data = pd.read_csv('/Users/xx/Downloads/LaqnData.csv')

data['ReadingDateTime'] = pd.to_datetime(data['ReadingDateTime'])

I then use datetime index to call each hour in the groupby function.

data.groupby([data.index.hour, data['Site']])['Value'].mean().reset_index()`#Convert to dataframe.`

To plot, I chain unstack to the groupby function and plot directly.

data.groupby([data.index.hour, data['Site']])['Value'].mean().reset_index#unstack().plot()
plt.xlabel('Hour of the day')
plt.ylabel('Ozone')
plt.title('Avarage Hourly comparison')
plt.legend()`# If you want the legend to appear in default location`

enter image description here If fussed about legend location, this post explains it very well. In your case;

plt.legend(loc='upper center', bbox_to_anchor=(0.5, -0.15),
          fancybox=True, shadow=True, ncol=6)

enter image description here

wwnde
  • 26,119
  • 6
  • 18
  • 32
  • when I run the code as you have written it I get the error 'AttributeError: 'RangeIndex' object has no attribute 'hour'' - I thought this was referencing the fact the index which was currently set had no 'hour' element, so i tried to set the index to datetime format using 'data.set_index('ReadingDateTime', inplace = True)' which gives me the output I would like in a tabular format, but then I am confused on the unstacking. Could you please elaborate? - I get the error 'No handles with labels found to put in legend.' if running as you have written above – CJ90 Mar 06 '20 at 01:46
  • I have gone back and rerun the code on a different machine and it works. Please set your ReadingDateTime to datetime before setting index as shown below; data['ReadingDateTime'] = pd.to_datetime(data['ReadingDateTime'])#Converting to datetime Proceed and check it has been set as datetime64[NS] BY CALLING: data.dtypes#Checking datatpes for all columns – wwnde Mar 06 '20 at 02:07
  • Stacking tiles up columns . Unstacking flatten s the dataframe. Consider plotting a stacked dataframe , pivot table with multi-layered columns (2d); data.groupby([data.index.hour, data['Site']])['Value'].mean() versus plotting an unstacked one with a single layer of columns data.groupby([data.index.hour, data['Site']])['Value'].mean().unstack() – wwnde Mar 06 '20 at 02:24
  • Once I set the index and used the unstack() function correctly it worked and I have accepted the answer, thanks! Can you explain why we need to reset_index() when using the groupby() function in this line - data.groupby([data.index.hour, data['Site']])['Value'].mean().reset_index() – CJ90 Mar 06 '20 at 03:36
  • 1
    A groupby function gives us a series. when reset_index() is chained to groupby, it converts the series into a dataframe. The other way to do this is to chain pandas.Series.to_frame¶ to the groupby function – wwnde Mar 06 '20 at 04:02