1

I have hourly data for each day of the year for 11 years. I want to find the maximum ozone value each day (I have 6 ozone columns), but I want to keep the entire row of data for the hour mark that the maximum value occurred, incluiding the date. I have a lot of meteorological parameters stored in this dataframe (like temperature, wind speed, etc) and don't want the code to drop any of the other columns. I want the code to show me every detail of the line, including the datetime column, that the maximum ozone value occurs. The important component of this code is that I need the maximum ozone value for any given day, while preserving all other rows of data, including the datetime column, meaning I want to see the year-month-day-hour that the maximum ozone value occurred.

Here's the link to my csv file: https://drive.google.com/file/d/1iyrvbD9gPHoTmwhSxo8aPfgfAIbpOBK6/view?usp=sharing

Here's the code I have tried thus far:

import pandas as pd

df = pd.read_csv('')
df.set_index('date', inplace=True)
df.index = pd.to_datetime(df.index)

#these are the columns I want to perform the action on:
ozone = ['MDA8_3135', 'MDA8_2551', 'MDA8_3186', 'MDA8_2878', 'MDA8_2199', 'MDA8_3168'] 

mda8 = df.loc[df.groupby([df.index.year, df.index.month, df.index.day], as_index=False)[ozone].idxmax()]

I have also tried:

df['day'] = df.index.day
df['year'] = df.index.year
df['month'] = df.index.month
df = df.reset_index()
mda8 = df.loc[df.groupby(['year', 'month', 'day'], as_index=False)[ozone].idxmax()]

I also tried an individual column:

mda8 = df.loc[df.groupby(['year', 'month', 'day'], as_index=False)["MDA8_3135"].max()]

I keep getting an error message that says ValueError: Cannot index with multidimensional key no matter what method I use. I assume it's because of the groupby, but someone has done it before and got it to work: Get the row(s) which have the max value in groups using groupby

EDIT

DATA SAMPLE:

df = pd.DataFrame({'date':['2009-01-01 00:00:00', 
'2009-01-01 01:00:00', 
'2009-01-01 02:00:00', 
'2009-01-01 03:00:00', 
'2009-01-02 04:00:00', 
'2009-01-02 05:00:00', 
'2009-01-02 06:00:00', 
'2009-01-02 07:00:00', 
'2009-01-03 08:00:00', 
'2009-01-03 09:00:00'], 
'nox_3135(ppb)':[20,29,27,31,33,14,34,23,32,31],
'CO_3135(ppm)':[0.8,0.9,0.1,0.2,0.5,0.5,0.7,0.9,0.9,0.3],
'MDA8_3135':[42,45,47,51,52,52,57,67,65,70],
'pm25_3135_2018':[6,7,6,7,4,5,2,11,9,18]})

Okay, so notice that there are multiple times for the same day. I want to pull the daily maximum ozone value from MDA8_3135 column, and I want to include all other rows from that specific maximum value. The new dataframe would look like:

new = pd.DataFrame({'date':['2009-01-01 03:00:00','2009-01-02 07:00:00', 
'2009-01-03 09:00:00'], 
'nox_3135(ppb)':[31,23,31],
'CO_3135(ppm)':[0.2,0.9,0.3],
'MDA8_3135':[51,67,70],
'pm25_3135_2018':[7,11,18]})

In my actual dataframe, I have 6 MDA8 columns, so I want to perform the code on all of these columns INDIVIDUALLY and create 6 new dataframes with the maximum ozone concentration and all the other columns that correspond to the specific timestamp that the maximum ozone value occurs on a given day.

Henry Ecker
  • 34,399
  • 18
  • 41
  • 57
obscuredbyclouds
  • 189
  • 1
  • 1
  • 15

1 Answers1

1

Based on your clarification that you want 6 separate dataframes, compute them in a comprehension.

For a given ozone column, use groupby.idxmax to find the date index of the max ozone value. Since some dates are NaT, dropna before indexing the matching rows with loc:

ozone = ['MDA8_3135', 'MDA8_2551', 'MDA8_3186', 'MDA8_2878', 'MDA8_2199', 'MDA8_3168'] 
df = pd.read_csv('IV_hourly_dataframe_Clim_Anom.csv', parse_dates=['date'], index_col=['date'])
out = {col: df.loc[df[col].groupby(df.index.date).idxmax().dropna()] for col in ozone}

Output using IV_hourly_dataframe_Clim_Anom.csv on pandas 1.3.3 and python 3.9.7:

>>> out['MDA8_3135']
#                      T_3186  T_5408  ...  MDA8_3135  ...  CO_3135_da8_anom  CO_2199_da8_anom
# date
# 2009-01-01 11:00:00    16.0     NaN  ...    35.0000  ...          1.096090          0.004785
# 2009-01-02 12:00:00    18.0     NaN  ...    32.1250  ...          0.258590          0.117285
# ...                     ...     ...  ...        ...  ...               ...               ...
# 2019-12-31 12:00:00    20.3   18.35  ...    37.2000  ...         -0.616592         -0.079884
>>> out['MDA8_2551']
#                      T_3186  T_5408  ...  MDA8_2551  ...  CO_3135_da8_anom  CO_2199_da8_anom
# date
# 2009-01-01 12:00:00    17.0     NaN  ...    34.2500  ...          0.383590          0.042285
# 2009-01-02 12:00:00    18.0     NaN  ...    28.7500  ...          0.258590          0.117285
# ...                     ...     ...  ...        ...  ...               ...               ...
# 2019-12-31 12:00:00    20.3   18.35  ...    32.7875  ...         -0.616592         -0.079884

...

>>> out['MDA8_3168']
#                      T_3186  T_5408  ...  MDA8_3168  ...  CO_3135_da8_anom  CO_2199_da8_anom
# date
# 2009-01-01 12:00:00    17.0     NaN  ...  39.250000  ...          0.383590          0.042285
# 2009-01-02 14:00:00    19.0     NaN  ...  58.000000  ...          0.508590          0.117285
# ...                     ...     ...  ...        ...  ...               ...               ...
# 2019-12-31 11:00:00    19.6   16.83  ...  44.187500  ...         -0.620967         -0.084234
tdy
  • 36,675
  • 19
  • 86
  • 83
  • Two things, maybe I wasn't clear in the op, but I need to create separate dataframes each time I run this code for different ozone columns. Your code above pulls the maximum for MDA8_2551 on 2009-01-01 but not for MDA8_3135. Doing them in one swoop wouldn't work, sorry for confusion. Also, I want to preserve the hour that this value occurred. I tried to run your code with both a single ozone column and two ozone columns but get ```ValueError: attempt to get argmax of an empty sequence```. I want the maximum ozone value, not the maximum time value. I want to preserve the date column. – obscuredbyclouds Oct 08 '21 at 16:58
  • re:error, what's your `pd.__version__`? i ran my code on your [IV_hourly_dataframe_Clim_Anom.csv](https://drive.google.com/file/d/1iyrvbD9gPHoTmwhSxo8aPfgfAIbpOBK6/view?usp=sharing) file and it worked fine (as in no errors, i realize now the output isn't actually what you wanted), so i'm guessing you have an older pandas version – tdy Oct 08 '21 at 17:13
  • I just updated it to the latest version and am still getting the same error message. – obscuredbyclouds Oct 08 '21 at 17:32
  • @Heather see if the updated answer is what you're looking for – tdy Oct 08 '21 at 22:43