I have hourly data for each day of the year for 11 years. I want to find the maximum ozone value each day (I have 6 ozone columns), but I want to keep the entire row of data for the hour mark that the maximum value occurred, incluiding the date. I have a lot of meteorological parameters stored in this dataframe (like temperature, wind speed, etc) and don't want the code to drop any of the other columns. I want the code to show me every detail of the line, including the datetime
column, that the maximum ozone value occurs. The important component of this code is that I need the maximum ozone value for any given day, while preserving all other rows of data, including the datetime column, meaning I want to see the year-month-day-hour that the maximum ozone value occurred.
Here's the link to my csv file: https://drive.google.com/file/d/1iyrvbD9gPHoTmwhSxo8aPfgfAIbpOBK6/view?usp=sharing
Here's the code I have tried thus far:
import pandas as pd
df = pd.read_csv('')
df.set_index('date', inplace=True)
df.index = pd.to_datetime(df.index)
#these are the columns I want to perform the action on:
ozone = ['MDA8_3135', 'MDA8_2551', 'MDA8_3186', 'MDA8_2878', 'MDA8_2199', 'MDA8_3168']
mda8 = df.loc[df.groupby([df.index.year, df.index.month, df.index.day], as_index=False)[ozone].idxmax()]
I have also tried:
df['day'] = df.index.day
df['year'] = df.index.year
df['month'] = df.index.month
df = df.reset_index()
mda8 = df.loc[df.groupby(['year', 'month', 'day'], as_index=False)[ozone].idxmax()]
I also tried an individual column:
mda8 = df.loc[df.groupby(['year', 'month', 'day'], as_index=False)["MDA8_3135"].max()]
I keep getting an error message that says ValueError: Cannot index with multidimensional key
no matter what method I use. I assume it's because of the groupby, but someone has done it before and got it to work:
Get the row(s) which have the max value in groups using groupby
EDIT
DATA SAMPLE:
df = pd.DataFrame({'date':['2009-01-01 00:00:00',
'2009-01-01 01:00:00',
'2009-01-01 02:00:00',
'2009-01-01 03:00:00',
'2009-01-02 04:00:00',
'2009-01-02 05:00:00',
'2009-01-02 06:00:00',
'2009-01-02 07:00:00',
'2009-01-03 08:00:00',
'2009-01-03 09:00:00'],
'nox_3135(ppb)':[20,29,27,31,33,14,34,23,32,31],
'CO_3135(ppm)':[0.8,0.9,0.1,0.2,0.5,0.5,0.7,0.9,0.9,0.3],
'MDA8_3135':[42,45,47,51,52,52,57,67,65,70],
'pm25_3135_2018':[6,7,6,7,4,5,2,11,9,18]})
Okay, so notice that there are multiple times for the same day. I want to pull the daily maximum ozone value from MDA8_3135
column, and I want to include all other rows from that specific maximum value.
The new dataframe would look like:
new = pd.DataFrame({'date':['2009-01-01 03:00:00','2009-01-02 07:00:00',
'2009-01-03 09:00:00'],
'nox_3135(ppb)':[31,23,31],
'CO_3135(ppm)':[0.2,0.9,0.3],
'MDA8_3135':[51,67,70],
'pm25_3135_2018':[7,11,18]})
In my actual dataframe, I have 6 MDA8 columns, so I want to perform the code on all of these columns INDIVIDUALLY and create 6 new dataframes with the maximum ozone concentration and all the other columns that correspond to the specific timestamp that the maximum ozone value occurs on a given day.