0

I have a data frame of more than 1,000,000 earthquakes. I am going down the list of earthquakes, declaring each one as the main shock. Then I run through the DataFrame again trying to identify any earthquake that might be considered an aftershock through a various set of parameters that are calculated in the for loop including magnitude size, distance from the main shock, and time after the main shock.

Right now the code works perfectly fine, but its been about 4 days and the code still isn't done. Each iteration takes about 3 minutes, but the larger earthquakes take about 10-15 minutes and the largest earthquake took about 6 hours (mainly because so many earthquakes were declared an aftershock).

So here I have a definition that calculated the rupture length of the earthquake (aftershocks have to be within a radius equal to the rupture length of the main shock) see subsurface_rupture_length):

def subsurface_rupture_length(magnitude):
    srl = 10**((magnitude - 4.38)/1.49)
    return(srl)

Then a definition to calculate the difference in days between the main shock and an aftershock (see days_calc):

def days_calc(aftershock_time,mainshock_time):
    days = (aftershock_time - mainshock_time).days
    return(days)

And then the magnitude difference between the main shock and the aftershock (see magnitude_differential) (its important to my research so this does need to stay).

def magnitude_differential(mainshock_magnitude,aftershock_magnitude):
    mag_diff = mainshock_magnitude - aftershock_magnitude
    return(mag_diff)

Then here's the main bulk of the code. Its just a for loop within a for loop. And it's kind of a mess, but I can't think of anything better at this point. I have a series of variable inputs that I have, like:

def aftershock_search(df,minimum_depth,maximum_depth,minimum_mainshock_magnitude,minimum_aftershock_magnitude,srl_multiplier,minimum_days):
    df1 = pd.read_csv(df,sep = ',')
    df2 = df1.sort_values(by = ['Magnitude'],ascending = False)
    df3 = df2.loc[(df2['Depth'] >= minimum_depth) & (df2['Depth'] < maximum_depth) & (df2['Magnitude'] >= minimum_aftershock_magnitude)]
    df3['Datetime'] = pd.to_datetime(df3[['Year','Month','Day','Hour','Minute','Second']])
    df3.reset_index(drop = True,inplace =  True)
    M_magnitude,M_datetime,M_latitude,M_longitude,M_depth,A_magnitude,A_datetime,A_latitude,A_longitude,A_depth,A_count,delta_m = [],[],[],[],[],[],[],[],[],[],[],[]
    for main,M in df3.iterrows():
        if M['Magnitude'] >= minimum_mainshock_magnitude:
            A_mag,A_dat,A_lat,A_lon,A_dep,mag_diff = [],[],[],[],[],[]
            srl = subsurface_rupture_length(M['Magnitude']) * srl_multiplier
            a = 0
            for after,A in df3.iterrows():
                d_m = magnitude_differential(M['Magnitude'],A['Magnitude'])
                if d_m > 0:
                    days = days_calc(A['Datetime'],M['Datetime'])
                    if days > 0 and days <= minimum_days:
                        distance = great_circle((M['Latitude'],M['Longitude']),(A['Latitude'],A['Longitude'])).km
                        if distance <= srl:
                            a+=1
                            A_mag.append(A['Magnitude'])
                            A_dat.append(A['Datetime'])
                            A_lat.append(A['Latitude'])
                            A_lon.append(A['Longitude'])
                            A_dep.append(A['Depth'])
                            mag_diff.append(d_m)
                            df3.drop(index = after,axis = 1,inplace = True)
                        else: continue
                    else: continue
                else: continue
            M_magnitude.append(M['Magnitude'])
            M_datetime.append(M['Datetime'])
            M_latitude.append(M['Latitude'])
            M_longitude.append(M['Longitude'])
            M_depth.append(M['Depth'])
            A_count.append(a)
            if a > 0:
                A_magnitude.append(A_mag[0])
                A_datetime.append(A_dat[0])
                A_latitude.append(A_lat[0])
                A_longitude.append(A_lon[0])
                A_depth.append(A_dep[0])
                delta_m.append(mag_diff[0])
            elif a == 0:
                A_magnitude.append('N/A')
                A_datetime.append('N/A')
                A_latitude.append('N/A')
                A_longitude.append('N/A')
                A_depth.append('N/A')
                delta_m.append('N/A')
        else: break
    
    Aftershocks = pd.DataFrame()
    Aftershocks['Mainshock Magnitude'] = M_magnitude
    Aftershocks['Mainshock Datetime'] = M_datetime
    Aftershocks['Mainshock Latitude'] = M_latitude
    Aftershocks['Mainshock Longitude'] = M_longitude
    Aftershocks['Mainshock Depth'] = M_depth
    Aftershocks['Aftershock Count'] = A_count
    Aftershocks['Largest Aftershock Magnitude'] = A_mag
    Aftershocks['Largest Aftershock Datetime'] = A_datetime
    Aftershocks['Largest Aftershock Latitude'] = A_latitude
    Aftershocks['Largest Aftershock Longitude'] = A_longitude
    Aftershocks['Largest Aftershock Depth'] = A_depth
    Aftershocks['Delta M'] = delta_m
    return(Aftershocks)

I just plug in the whole dataframe into this definition.

At the end I bundle everything up into a dataframe and save it as a csv later.

I would love the help to try and speed this up!

Thanks!

Update. I've added a sample of the earthquake catalog that I am using:

Header Year Month Day Hour Minute Second Latitude Longitude Depth Magnitude Magnitude Type
J 1983 1 1 0 36 58.4 33.778333333333336 139.35333333333332 21.0 3.9 V
J 1983 1 1 0 44 37.3 33.891666666666666 139.41833333333332 0.0 3.2 V
J 1983 1 1 1 5 51.6 33.91833333333334 139.43666666666667 17.0 3.7 V
J 1983 1 1 1 18 54.0 33.78 139.41 16.0 2.8 V
J 1983 1 1 1 21 54.4 33.79833333333333 139.335 9.0 2.9 V
J 1983 1 1 1 26 26.3 33.77333333333333 139.41833333333332 20.0 3.4 V
J 1983 1 1 1 30 26.6 33.8 139.34 0.0 0.0
J 1983 1 1 1 36 26.6 33.778333333333336 139.43666666666667 30.0 3.0 V
J 1983 1 1 1 47 47.5 33.785 139.33333333333334 11.0 2.6 V
J 1983 1 1 1 51 48.3 33.763333333333335 139.39 15.0 3.0 V
J 1983 1 1 1 53 13.7 33.778333333333336 139.37666666666667 19.0 3.1 V
J 1983 1 1 1 54 3.1 33.79 139.375 19.0 4.1 J
J 1983 1 1 2 18 30.9 33.76166666666667 139.34666666666666 17.0 3.7 V
J 1983 1 1 2 19 46.9 33.788333333333334 139.37333333333333 26.0 3.3 V
J 1983 1 1 2 26 8.0 33.846666666666664 139.51833333333335 41.0 2.5 V
J 1983 1 1 2 55 7.2 36.41833333333334 141.12833333333333 44.0 2.8 V
J 1983 1 1 2 55 24.0 33.79833333333333 139.315 14.0 2.5 V
J 1983 1 1 2 56 50.4 33.803333333333335 139.385 23.0 2.7 V
J 1983 1 1 3 3 3.8 33.638333333333335 139.66333333333333 25.0 4.0 V
J 1983 1 1 3 15 4.6 33.79333333333334 139.39166666666668 29.0 2.5 V
J 1983 1 1 3 17 14.3 33.803333333333335 139.30833333333334 0.0 3.2 V
J 1983 1 1 3 19 7.7 33.751666666666665 139.315 4.0 2.9 V
J 1983 1 1 3 31 17.2 33.77166666666667 139.32 4.0 2.9 V
J 1983 1 1 3 41 1.1 33.78333333333333 139.31666666666666 0.0 2.6 V
J 1983 1 1 3 49 15.7 33.785 139.33333333333334 11.0 2.5 V
J 1983 1 1 3 50 34.8 33.843333333333334 139.48 35.0 2.8 V
J 1983 1 1 3 52 32.8 33.751666666666665 139.26833333333335 0.0 3.8 V
J 1983 1 1 3 56 9.8 33.846666666666664 139.485 41.0 3.0 V
J 1983 1 1 4 2 22.3 33.83833333333333 139.46833333333333 38.0 3.1 V
J 1983 1 1 4 15 40.7 36.29666666666667 140.95666666666668 45.0 2.2 V
J 1983 1 1 4 27 29.9 33.83833333333333 139.455 36.0 3.0 V
J 1983 1 1 5 23 44.5 33.82333333333333 139.41 29.0 2.8 V
Kyle Macy
  • 93
  • 4
  • Why are you iterating thru the list. Can't you mark the mainshock based on a condition – Joe Ferndz Feb 19 '21 at 23:08
  • can we do step 1: identify what values below a certain value can be ignored. Remove that from the main dataframe. Do not SORT the data yet. Then look at all scores that can potentially be major earthquakes. For the rest of them, it will be comparison to the major earthquakes. If you do that, it will reduce the cycle time and also eliminate scanning full dataframe – Joe Ferndz Feb 19 '21 at 23:16
  • Also can you share your source data or share sample data. I would like to look at this and provide some recommendations to improve performance – Joe Ferndz Feb 19 '21 at 23:23
  • Hey Joe, I've added a table of sample data. – Kyle Macy Feb 20 '21 at 20:46
  • Excellent. Step 1: We can convert these columns into a single column `'Year','Month','Day','Hour','Minute','Second'` into `eq_timestamp` using datetime function. Then Step 2: Let's define what constitutes a `MainShock`. Then I assume the `Aftershock` will be something that happens immediately following the `MainShock` until we have the next `MainShock` ? What I am trying to do is the reduce the dataset that needs to be processed. – Joe Ferndz Feb 20 '21 at 21:01

1 Answers1

0

Pandas is very slow when iterating. An easy way to speed it up is to use pandas selectors in your inner loop. For example, instead of doing:

for after,A in df3.iterrows():
    d_m = magnitude_differential(M['Magnitude'],A['Magnitude'])
    if d_m > 0:
        ...
    else: continue

You can do:

d_m_larger_than_zero = df3[magnitude_differential(M['Magnitude'], df3['Magnitude']) > 0]

The DataFrame d_m_larger_than_zero will only contain the rows with a positive magnitude differential. This is in general several orders of magnitude faster than iterating over the frame. You can even use it before the main loop. Instead of:

for main,M in df3.iterrows():
   if M['Magnitude'] >= minimum_mainshock_magnitude:
        ...

Do:

mainshock_events = df3[df3['Magnitude'] >= minimum_mainshock_magnitude]
for main,M in mainshock_events.iterrows():  # Only mainshocks here
    ...
nonDucor
  • 2,057
  • 12
  • 17