Subsetting out rows using pandas

Question

I have two sets of dataframes: datamax, datamax2015 and datamin, datamin2015.

Snippet of data:

print(datamax.head())
print(datamin.head())
print(datamax2015.head())
print(datamin2015.head())

Date           ID Element  Data_Value
0  2005-01-01  USW00094889    TMAX         156
1  2005-01-02  USW00094889    TMAX         139
2  2005-01-03  USW00094889    TMAX         133
3  2005-01-04  USW00094889    TMAX          39
4  2005-01-05  USW00094889    TMAX          33
         Date           ID Element  Data_Value
0  2005-01-01  USC00200032    TMIN         -56
1  2005-01-02  USC00200032    TMIN         -56
2  2005-01-03  USC00200032    TMIN           0
3  2005-01-04  USC00200032    TMIN         -39
4  2005-01-05  USC00200032    TMIN         -94
         Date           ID Element  Data_Value
0  2015-01-01  USW00094889    TMAX          11
1  2015-01-02  USW00094889    TMAX          39
2  2015-01-03  USW00014853    TMAX          39
3  2015-01-04  USW00094889    TMAX          44
4  2015-01-05  USW00094889    TMAX          28
         Date           ID Element  Data_Value
0  2015-01-01  USC00200032    TMIN        -133
1  2015-01-02  USC00200032    TMIN        -122
2  2015-01-03  USC00200032    TMIN         -67
3  2015-01-04  USC00200032    TMIN         -88
4  2015-01-05  USC00200032    TMIN        -155

For datamax, datamax2015, I want to compare their Data_Value columns and create a dataframe of entries in datamax2015 whose Data_Value is greater than all entries in datamax for the same day of the year. Thus, the expected output should be a dataframe with rows from 2015-01-01 to 2015-12-31 but with dates only where the values in the Data_Value column are greater than those in the Data_Value column of the datamax dataframe.

i.e 4 rows and anything from 1 to 364 columns depending on the condition above.

I want the converse (min) for the datamin and datamin2015 dataframes.

I have tried the following code:

upper = []
for row in datamax.iterrows():
    for j in datamax2015["Data_Value"]:
        if j > row["Data_Value"]:
            upper.append(row)
lower = []
for row in datamin.iterrows():
    for j in datamin2015["Data_Value"]:
        if j < row["Data_Value"]:
            lower.append(row)

Could anyone give me a helping hand as to where I am going wrong?

Can you post a snippet from the dataframes (datamin, datamin2015 and datamax, datamax2015)? In particular, do the data frames in each of the two pairs always have the same number of rows? — Jonas, May 30 '20 at 15:52
No, one has dates 2005-2014, the other only dates 2015. And yes will do above. — Caledonian26, May 30 '20 at 15:52
Thanks. Can you please also provide the expected output for the example dataframe you posted? It's unclear to me which rows in datamax should have their `Data_Value` entry compared to which rows in datamax2015. - Those with the same day of year? — Jonas, May 30 '20 at 15:56
Ok, thanks. Are you comparing to the min across all years 2005-2014 for that day of the year in datamin? — Jonas, May 30 '20 at 16:08
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/214991/discussion-between-jonas-and-caledonian26). — Jonas, May 30 '20 at 16:26

Jonas · Accepted Answer · 2020-05-30T16:17:57.877

2

This code does what you want for the datamin. Try to adapt it to the datamax symmetric case as well - leave a comment if you have trouble and happy to help further.

Create Data

from datetime import datetime
import pandas as pd

datamin = pd.DataFrame({"date": pd.date_range(start=datetime(2005, 1, 1), end=datetime(2015, 12, 31)), "Data_Value": 1})
datamin["day_of_year"] = datamin["date"].dt.dayofyear
# Set the value for the 4th day of the year higher in order for the desired result to be non-empty
datamin.loc[datamin["day_of_year"]==4, "Data_Value"] = 2 

datamin2015 = pd.DataFrame({"date": pd.date_range(start=datetime(2015, 1, 1), end=datetime(2015, 12, 31)), "Data_Value": 2})
datamin2015["day_of_year"] = datamin["date"].dt.dayofyear
# Set the value for the 4th day of the year lower in order for the desired result to be non-empty
datamin2015.loc[3, "Data_Value"] = 1

The solution

df1 = datamin.groupby("day_of_year").agg({"Data_Value": "min"})
df2 = datamin2015.join(df1, on="day_of_year", how="left", lsuffix="2015")
lower = df2.loc[df2["Data_Value2015"]<df2["Data_Value"]]
lower

We group the datamin by day of year to find the min across all the years for each day of the year (using .dt.dayofyear). Then we join that with datamin2015 and finally can then compare the Data_Value2015 with Data_Value in order to find the indexes of the rows where the Data_Value in 2015 was less than the minimum across all same days of the year in datamin.

In the example above lower has 1 row by the way I set up the dataframes.

edited May 30 '20 at 16:17

answered May 30 '20 at 16:12

Jonas

1,473
2
13
28

why do we need to set the value for the 4th day of the year higher in order for the desired result to be non-empty? – Caledonian26 May 30 '20 at 16:19
You don't have to do this in your code - I just did it to easily generate an example to demonstrate :) – Jonas May 30 '20 at 16:20
Can only use .dt accessor with datetimelike values: I now get this error? – Caledonian26 May 30 '20 at 16:22
Then you probably need to convert your `Date` column: https://stackoverflow.com/questions/17134716/convert-dataframe-column-type-from-string-to-datetime-dd-mm-yyyy-format – Jonas May 30 '20 at 16:28
yes done that, but I still get an error: KeyError: 'Date' – Caledonian26 May 30 '20 at 16:42
why are we doing a left join? – Caledonian26 May 30 '20 at 19:08
Doing a left join will ensure that only dates in 2015 are in the lower or upper dataframes. This usually doesn't matter, except if datamin contains leap years but datamin2015 doesn't (which is the case here). Using join="left" excludes those leap days from the result. – Jonas May 30 '20 at 19:22
can we have a private chat, because I feel my code is just right but the program is not accepting it. On visual inspection, my code doesn't seem to have any errors but it keeps giving me the error above... – Caledonian26 May 30 '20 at 19:34
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/215003/discussion-between-jonas-and-caledonian26). – Jonas May 30 '20 at 21:39
1

see answer below! – Caledonian26 May 31 '20 at 15:17

score 0 · Answer 2 · answered May 31 '20 at 15:17

Python code which returns a line graph of the record high and record low temperatures by day of the year over the period 2005-2014. The area between the record high and record low temperatures for each day should be shaded.
Overlay a scatter of the 2015 data for any points (highs and lows) for which the ten year record (2005-2014) record high or record low was broken in 2015.

Remove leap year dates (i.e. 29th February).

from datetime import datetime
import pandas as pd
import matplotlib.pyplot as plt

pd.set_option("display.max_rows",None,"display.max_columns",None)
data = pd.read_csv('data/C2A2_data/BinnedCsvs_d400/fb441e62df2d58994928907a91895ec62c2c42e6cd075c2700843b89.csv') 
newdata = data[(data['Date'] >= '2005-01-01') & (data['Date'] <= '2014-12-12')]
datamax = newdata[newdata['Element']=='TMAX']
datamin = newdata[newdata['Element']=='TMIN']
datamax['Date'] = pd.to_datetime(datamax['Date'])
datamin['Date'] = pd.to_datetime(datamin['Date'])
datamax["day_of_year"] = datamax["Date"].dt.dayofyear
datamax = datamax.groupby('day_of_year').max()
datamin["day_of_year"] = datamin["Date"].dt.dayofyear
datamin = datamin.groupby('day_of_year').min()
datamax = datamax.reset_index()
datamin = datamin.reset_index()
datamin['Date'] = datamin['Date'].dt.strftime('%Y-%m-%d')
datamax['Date'] = datamax['Date'].dt.strftime('%Y-%m-%d')
datamax = datamax[~datamax['Date'].str.contains("02-29")]
datamin = datamin[~datamin['Date'].str.contains("02-29")]

breakoutdata = data[(data['Date']  > '2014-12-31')]
datamax2015 = breakoutdata[breakoutdata['Element']=='TMAX']
datamin2015 = breakoutdata[breakoutdata['Element']=='TMIN']
datamax2015['Date'] = pd.to_datetime(datamax2015['Date'])
datamin2015['Date'] = pd.to_datetime(datamin2015['Date'])
datamax2015["day_of_year"] = datamax2015["Date"].dt.dayofyear
datamax2015 = datamax2015.groupby('day_of_year').max()
datamin2015["day_of_year"] = datamin2015["Date"].dt.dayofyear
datamin2015 = datamin2015.groupby('day_of_year').min()
datamax2015 = datamax2015.reset_index()
datamin2015 = datamin2015.reset_index()
datamin2015['Date'] = datamin2015['Date'].dt.strftime('%Y-%m-%d')
datamax2015['Date'] = datamax2015['Date'].dt.strftime('%Y-%m-%d')
datamax2015 = datamax2015[~datamax2015['Date'].str.contains("02-29")]
datamin2015 = datamin2015[~datamin2015['Date'].str.contains("02-29")]

dataminappend = datamin2015.join(datamin,on="day_of_year",rsuffix="_new")
lower = dataminappend.loc[dataminappend["Data_Value_new"]>dataminappend["Data_Value"]]
datamaxappend = datamax2015.join(datamax,on="day_of_year",rsuffix="_new")
upper = datamaxappend.loc[datamaxappend["Data_Value_new"]<datamaxappend["Data_Value"]]

upper['Date'] = pd.to_datetime(upper['Date']) 
lower['Date'] = pd.to_datetime(lower['Date']) 
datamax['Date'] = pd.to_datetime(datamax['Date']) 
datamin['Date'] = pd.to_datetime(datamin['Date']) 

ax = plt.gca()
plt.plot(datamax['day_of_year'],datamax['Data_Value'],color='red')
plt.plot(datamin['day_of_year'],datamin['Data_Value'], color='blue')
plt.scatter(upper['day_of_year'],upper['Data_Value'],color='purple')
plt.scatter(lower['day_of_year'],lower['Data_Value'], color='cyan')

plt.ylabel("Temperature (degrees C)",color='navy')
plt.xlabel("Date",color='navy',labelpad=15)
plt.title('Record high and low temperatures by day (2005-2014)', alpha=1.0,color='brown',y=1.08)
ax.legend(loc='upper center', bbox_to_anchor=(0.5, -0.35),fancybox=False,labels=['Record high','Record low'])
plt.xticks(rotation=30)
plt.fill_between(range(len(datamax['Date'])), datamax['Data_Value'], datamin['Data_Value'],color='yellow',alpha=0.8)
plt.show()

I have converted the 'Date' column to a string using Datamin['Date'] = datamin['Date'].dt.strftime('%Y-%m-%d').
I have then converted this back to 'datetime' format using upper['Date'] = pd.to_datetime(upper['Date'])
I then used 'date of year' as the x-value.

Subsetting out rows using pandas

2 Answers2

Create Data

The solution