So I have a large dataframe of daily temperature data that I've been filtering and working with. The '#Above' column means at that latitude/longitude point on that day (date not shown, just the Year is retained), the temperature was above a threshold. The '#Below' is the same but below the threshold. An event for either is considered a single event if it lasts for 5 days or more. For example, in the first row below, there were days above and below the respective thresholds but not enough to by called days or events. In the second row, however, you will see that 2 events of 5 days each correspond to 2 separate '#Mevents', 5+5=10 '#Mdays', and in the 'All Events' listed as two 5's.
Lat Lon Year #Above #Mdays #Mevents #Below #Cdays #Cevents All Events
44 24.125 262.375 1983 1 0 0 -4 0 0 []
45 24.125 262.375 1985 12 10 2 -25 -20 2 [ 5, -7, -13, 5]
46 24.125 262.375 1986 30 24 2 -16 -5 1 [12, 12, -5]
47 24.125 262.375 1987 5 0 0 -41 -26 2 [-26, -8]
... ... ... ... ... ... ... ... ... ... ...
Here is the loop that I've been using. (PS it takes 1.5 hours to run, any optimization tricks let me know >_>)
There are ~500 points, 365 days*40 years worth of data so it ends up being a lot.
more_data = pd.DataFrame([])
for iii in cdo_gom.lat.values:
for jjj in cdo_gom.lon.values:
for yyyy in np.arange(1981, 2022):
point = master_df.loc[(master_df['lat'] == iii) & (master_df['lon'] == jjj)]
year = point.loc[(point['year'] == yyyy)]
a = np.asarray(year['threshold'].groupby(year['threshold'].diff().ne(0).cumsum()).cumsum())
MHW = np.count_nonzero(a >= 5)
above = np.count_nonzero(a >= 1)
CS = np.count_nonzero(a <= -5)
below = np.count_nonzero(a <= -1)
events = a[np.argwhere(np.abs(a-np.roll(a,-1)) >= 5)].reshape(-1)
hw_events = np.sum(np.array(events) >= 0, axis=0)
cs_events = np.sum(np.array(events) <= 0, axis=0)
elist = np.array2string(events, separator=', ')
more_data = more_data.append(pd.DataFrame({'Lat' : iii,
'Lon' : jjj,
'Month' : mmm,
'Year' : yyyy,
'#MHW days' : MHW + (4*hw_events),
'#MHW events' : hw_events,
'#CS days' : CS + (4*cs_events),
'#CS events' : cs_events,
'Threshold duration Events' : elist},
index=[0]), ignore_index = True)
I've tried different things to get the maximum and minimum for 'All Events', such as maxCS1 = np.min(events) but I keep getting ValueError: zero-size array to reduction operation fmax which has no identity.
Then I was trying to take the max/mins of each row in the 'All Events' column itself (TypeError: 'float' object is not iterable) or applying max/min to the column (same TypeError).
Should I not be converting from array to string (from 'events' variable to 'elist' variable)? How else could I list all of the events, and the max/min of each row?
Sorry for any lack of clarity. I will happily answer any questions.
What I want:
Lat Lon Year #Above #Mdays #Mevents #Below #Cdays #Cevents All Events MaxM MaxC
0 24.125 262.375 1983 1 0 0 -4 0 0 []
1 24.125 262.375 1985 12 10 2 -25 -20 2 [ 5, -7, -13, 5] 5 -13
2 24.125 262.375 1986 30 24 2 -16 -5 1 [12, 12, -5] 12 -5
3 24.125 262.375 1987 5 0 0 -41 -26 2 [-26, -8] -26
... ... ... ... ... ... ... ... ... ... ...