0

So I have a large dataframe of daily temperature data that I've been filtering and working with. The '#Above' column means at that latitude/longitude point on that day (date not shown, just the Year is retained), the temperature was above a threshold. The '#Below' is the same but below the threshold. An event for either is considered a single event if it lasts for 5 days or more. For example, in the first row below, there were days above and below the respective thresholds but not enough to by called days or events. In the second row, however, you will see that 2 events of 5 days each correspond to 2 separate '#Mevents', 5+5=10 '#Mdays', and in the 'All Events' listed as two 5's.

    Lat     Lon     Year    #Above  #Mdays   #Mevents   #Below  #Cdays    #Cevents    All Events
44  24.125  262.375 1983    1       0        0          -4       0        0           []
45  24.125  262.375 1985    12      10       2          -25     -20       2           [ 5, -7, -13, 5]  
46  24.125  262.375 1986    30      24       2          -16     -5        1           [12, 12, -5]
47  24.125  262.375 1987    5       0        0          -41     -26       2           [-26, -8]
... ... ... ... ... ... ... ... ... ... ...

Here is the loop that I've been using. (PS it takes 1.5 hours to run, any optimization tricks let me know >_>)
There are ~500 points, 365 days*40 years worth of data so it ends up being a lot.

more_data = pd.DataFrame([])
for iii in cdo_gom.lat.values:
    for jjj in cdo_gom.lon.values:
        for yyyy in np.arange(1981, 2022):
            point = master_df.loc[(master_df['lat'] == iii) & (master_df['lon'] == jjj)]
            year = point.loc[(point['year'] == yyyy)]
            a = np.asarray(year['threshold'].groupby(year['threshold'].diff().ne(0).cumsum()).cumsum())
            MHW = np.count_nonzero(a >= 5)
            above = np.count_nonzero(a >= 1)
            CS = np.count_nonzero(a <= -5)
            below = np.count_nonzero(a <= -1)

            events = a[np.argwhere(np.abs(a-np.roll(a,-1)) >= 5)].reshape(-1)
            hw_events = np.sum(np.array(events) >= 0, axis=0)
            cs_events = np.sum(np.array(events) <= 0, axis=0)

            elist = np.array2string(events, separator=', ')

            more_data = more_data.append(pd.DataFrame({'Lat' : iii, 
                                                       'Lon' : jjj, 
                                                        'Month' : mmm,
                                                        'Year' : yyyy,
                                                        '#MHW days' : MHW + (4*hw_events),
                                                        '#MHW events' : hw_events,
                                                        '#CS days' : CS + (4*cs_events),
                                                        '#CS events' : cs_events,
                                                        'Threshold duration Events' : elist}, 
                                                        index=[0]), ignore_index = True)

I've tried different things to get the maximum and minimum for 'All Events', such as maxCS1 = np.min(events) but I keep getting ValueError: zero-size array to reduction operation fmax which has no identity.
Then I was trying to take the max/mins of each row in the 'All Events' column itself (TypeError: 'float' object is not iterable) or applying max/min to the column (same TypeError).

Should I not be converting from array to string (from 'events' variable to 'elist' variable)? How else could I list all of the events, and the max/min of each row?

Sorry for any lack of clarity. I will happily answer any questions.

What I want:

    Lat     Lon     Year    #Above  #Mdays   #Mevents   #Below  #Cdays    #Cevents   All Events       MaxM     MaxC 
0   24.125  262.375 1983    1       0        0          -4       0        0          []     
1   24.125  262.375 1985    12      10       2          -25     -20       2          [ 5, -7, -13, 5] 5        -13
2   24.125  262.375 1986    30      24       2          -16     -5        1          [12, 12, -5]     12       -5
3   24.125  262.375 1987    5       0        0          -41     -26       2          [-26, -8]                 -26
... ... ... ... ... ... ... ... ... ... ...
spokati
  • 45
  • 5
  • See also: https://stackoverflow.com/questions/27263805/pandas-column-of-lists-create-a-row-for-each-list-element – cbare Jul 14 '21 at 04:17

1 Answers1

0

The answer probably depends on what data type your "All Events" column is. For example,

import pandas as pd

df = pd.DataFrame({"a": ["a", "b", "c"], "events":[
    [7,13,-2],
    [11,-9,12],
    [4,6,8],
]})

type(df.events[0])

...gives you elements of type list. You can directly apply min and max like so:

df.events.apply(min)

0   -2
1   -9
2    4
Name: events, dtype: int64

Pandas represents missing data as NaN, which is a special floating point value. That might be the source of the TypeErrors you saw.

Looking at your code, your events look like they're numpy arrays? So you have something like this:

import numpy as np

df = pd.DataFrame({"a": ["a", "b", "c", "d"], "events":[
    np.array([7,13,-2]),
    np.array([11,-9,12]),
    np.array([4, 6, 8, 15]),
    float("NaN")
]})

To deal with the possibility of NaNs, you'll want something like:

df["MinEvent"] = df.events.apply(lambda x:min(x) if isinstance(x, np.ndarray) else x)

df

    a   events        MinEvent
0   a   [7, 13, -2]     -2.0
1   b   [11, -9, 12]    -9.0
2   c   [4, 6, 8, 15]   4.0
3   d   NaN             NaN

If you want to check for NaNs or for missing data more broadly defined, isna a.k.a isnull come in handy:

df.events.isnull().value_counts()

False    3
True     1
Name: events, dtype: int64

Finally, building up a DataFrame by appending row by row is really slow. It'll work faster to create whole columns or maybe try the form of the constructor that takes all the rows and a list of column names: pd.DataFrame(list_of_rows, columns=['A', 'B', 'C', ...])

cbare
  • 12,060
  • 8
  • 56
  • 63