Is there a fast way to iterate through a pandas data frame where calculations are involved?

Question

I have a data frame of more than 1,000,000 earthquakes. I am going down the list of earthquakes, declaring each one as the main shock. Then I run through the DataFrame again trying to identify any earthquake that might be considered an aftershock through a various set of parameters that are calculated in the for loop including magnitude size, distance from the main shock, and time after the main shock.

Right now the code works perfectly fine, but its been about 4 days and the code still isn't done. Each iteration takes about 3 minutes, but the larger earthquakes take about 10-15 minutes and the largest earthquake took about 6 hours (mainly because so many earthquakes were declared an aftershock).

So here I have a definition that calculated the rupture length of the earthquake (aftershocks have to be within a radius equal to the rupture length of the main shock) see subsurface_rupture_length):

def subsurface_rupture_length(magnitude):
    srl = 10**((magnitude - 4.38)/1.49)
    return(srl)

Then a definition to calculate the difference in days between the main shock and an aftershock (see days_calc):

def days_calc(aftershock_time,mainshock_time):
    days = (aftershock_time - mainshock_time).days
    return(days)

And then the magnitude difference between the main shock and the aftershock (see magnitude_differential) (its important to my research so this does need to stay).

def magnitude_differential(mainshock_magnitude,aftershock_magnitude):
    mag_diff = mainshock_magnitude - aftershock_magnitude
    return(mag_diff)

Then here's the main bulk of the code. Its just a for loop within a for loop. And it's kind of a mess, but I can't think of anything better at this point. I have a series of variable inputs that I have, like:

def aftershock_search(df,minimum_depth,maximum_depth,minimum_mainshock_magnitude,minimum_aftershock_magnitude,srl_multiplier,minimum_days):
    df1 = pd.read_csv(df,sep = ',')
    df2 = df1.sort_values(by = ['Magnitude'],ascending = False)
    df3 = df2.loc[(df2['Depth'] >= minimum_depth) & (df2['Depth'] < maximum_depth) & (df2['Magnitude'] >= minimum_aftershock_magnitude)]
    df3['Datetime'] = pd.to_datetime(df3[['Year','Month','Day','Hour','Minute','Second']])
    df3.reset_index(drop = True,inplace =  True)
    M_magnitude,M_datetime,M_latitude,M_longitude,M_depth,A_magnitude,A_datetime,A_latitude,A_longitude,A_depth,A_count,delta_m = [],[],[],[],[],[],[],[],[],[],[],[]
    for main,M in df3.iterrows():
        if M['Magnitude'] >= minimum_mainshock_magnitude:
            A_mag,A_dat,A_lat,A_lon,A_dep,mag_diff = [],[],[],[],[],[]
            srl = subsurface_rupture_length(M['Magnitude']) * srl_multiplier
            a = 0
            for after,A in df3.iterrows():
                d_m = magnitude_differential(M['Magnitude'],A['Magnitude'])
                if d_m > 0:
                    days = days_calc(A['Datetime'],M['Datetime'])
                    if days > 0 and days <= minimum_days:
                        distance = great_circle((M['Latitude'],M['Longitude']),(A['Latitude'],A['Longitude'])).km
                        if distance <= srl:
                            a+=1
                            A_mag.append(A['Magnitude'])
                            A_dat.append(A['Datetime'])
                            A_lat.append(A['Latitude'])
                            A_lon.append(A['Longitude'])
                            A_dep.append(A['Depth'])
                            mag_diff.append(d_m)
                            df3.drop(index = after,axis = 1,inplace = True)
                        else: continue
                    else: continue
                else: continue
            M_magnitude.append(M['Magnitude'])
            M_datetime.append(M['Datetime'])
            M_latitude.append(M['Latitude'])
            M_longitude.append(M['Longitude'])
            M_depth.append(M['Depth'])
            A_count.append(a)
            if a > 0:
                A_magnitude.append(A_mag[0])
                A_datetime.append(A_dat[0])
                A_latitude.append(A_lat[0])
                A_longitude.append(A_lon[0])
                A_depth.append(A_dep[0])
                delta_m.append(mag_diff[0])
            elif a == 0:
                A_magnitude.append('N/A')
                A_datetime.append('N/A')
                A_latitude.append('N/A')
                A_longitude.append('N/A')
                A_depth.append('N/A')
                delta_m.append('N/A')
        else: break
    
    Aftershocks = pd.DataFrame()
    Aftershocks['Mainshock Magnitude'] = M_magnitude
    Aftershocks['Mainshock Datetime'] = M_datetime
    Aftershocks['Mainshock Latitude'] = M_latitude
    Aftershocks['Mainshock Longitude'] = M_longitude
    Aftershocks['Mainshock Depth'] = M_depth
    Aftershocks['Aftershock Count'] = A_count
    Aftershocks['Largest Aftershock Magnitude'] = A_mag
    Aftershocks['Largest Aftershock Datetime'] = A_datetime
    Aftershocks['Largest Aftershock Latitude'] = A_latitude
    Aftershocks['Largest Aftershock Longitude'] = A_longitude
    Aftershocks['Largest Aftershock Depth'] = A_depth
    Aftershocks['Delta M'] = delta_m
    return(Aftershocks)

I just plug in the whole dataframe into this definition.

At the end I bundle everything up into a dataframe and save it as a csv later.

I would love the help to try and speed this up!

Thanks!

Update. I've added a sample of the earthquake catalog that I am using:

Header	Year	Month	Day	Hour	Minute	Second	Latitude	Longitude	Depth	Magnitude	Magnitude Type
J	1983	1	1	0	36	58.4	33.778333333333336	139.35333333333332	21.0	3.9	V
J	1983	1	1	0	44	37.3	33.891666666666666	139.41833333333332	0.0	3.2	V
J	1983	1	1	1	5	51.6	33.91833333333334	139.43666666666667	17.0	3.7	V
J	1983	1	1	1	18	54.0	33.78	139.41	16.0	2.8	V
J	1983	1	1	1	21	54.4	33.79833333333333	139.335	9.0	2.9	V
J	1983	1	1	1	26	26.3	33.77333333333333	139.41833333333332	20.0	3.4	V
J	1983	1	1	1	30	26.6	33.8	139.34	0.0	0.0
J	1983	1	1	1	36	26.6	33.778333333333336	139.43666666666667	30.0	3.0	V
J	1983	1	1	1	47	47.5	33.785	139.33333333333334	11.0	2.6	V
J	1983	1	1	1	51	48.3	33.763333333333335	139.39	15.0	3.0	V
J	1983	1	1	1	53	13.7	33.778333333333336	139.37666666666667	19.0	3.1	V
J	1983	1	1	1	54	3.1	33.79	139.375	19.0	4.1	J
J	1983	1	1	2	18	30.9	33.76166666666667	139.34666666666666	17.0	3.7	V
J	1983	1	1	2	19	46.9	33.788333333333334	139.37333333333333	26.0	3.3	V
J	1983	1	1	2	26	8.0	33.846666666666664	139.51833333333335	41.0	2.5	V
J	1983	1	1	2	55	7.2	36.41833333333334	141.12833333333333	44.0	2.8	V
J	1983	1	1	2	55	24.0	33.79833333333333	139.315	14.0	2.5	V
J	1983	1	1	2	56	50.4	33.803333333333335	139.385	23.0	2.7	V
J	1983	1	1	3	3	3.8	33.638333333333335	139.66333333333333	25.0	4.0	V
J	1983	1	1	3	15	4.6	33.79333333333334	139.39166666666668	29.0	2.5	V
J	1983	1	1	3	17	14.3	33.803333333333335	139.30833333333334	0.0	3.2	V
J	1983	1	1	3	19	7.7	33.751666666666665	139.315	4.0	2.9	V
J	1983	1	1	3	31	17.2	33.77166666666667	139.32	4.0	2.9	V
J	1983	1	1	3	41	1.1	33.78333333333333	139.31666666666666	0.0	2.6	V
J	1983	1	1	3	49	15.7	33.785	139.33333333333334	11.0	2.5	V
J	1983	1	1	3	50	34.8	33.843333333333334	139.48	35.0	2.8	V
J	1983	1	1	3	52	32.8	33.751666666666665	139.26833333333335	0.0	3.8	V
J	1983	1	1	3	56	9.8	33.846666666666664	139.485	41.0	3.0	V
J	1983	1	1	4	2	22.3	33.83833333333333	139.46833333333333	38.0	3.1	V
J	1983	1	1	4	15	40.7	36.29666666666667	140.95666666666668	45.0	2.2	V
J	1983	1	1	4	27	29.9	33.83833333333333	139.455	36.0	3.0	V
J	1983	1	1	5	23	44.5	33.82333333333333	139.41	29.0	2.8	V

Why are you iterating thru the list. Can't you mark the mainshock based on a condition — Joe Ferndz, Feb 19 '21 at 23:08
can we do step 1: identify what values below a certain value can be ignored. Remove that from the main dataframe. Do not SORT the data yet. Then look at all scores that can potentially be major earthquakes. For the rest of them, it will be comparison to the major earthquakes. If you do that, it will reduce the cycle time and also eliminate scanning full dataframe — Joe Ferndz, Feb 19 '21 at 23:16
Also can you share your source data or share sample data. I would like to look at this and provide some recommendations to improve performance — Joe Ferndz, Feb 19 '21 at 23:23
Excellent. Step 1: We can convert these columns into a single column `'Year','Month','Day','Hour','Minute','Second'` into `eq_timestamp` using datetime function. Then Step 2: Let's define what constitutes a `MainShock`. Then I assume the `Aftershock` will be something that happens immediately following the `MainShock` until we have the next `MainShock` ? What I am trying to do is the reduce the dataset that needs to be processed. — Joe Ferndz, Feb 20 '21 at 21:01

score 0 · Answer 1 · answered Feb 20 '21 at 21:22

Pandas is very slow when iterating. An easy way to speed it up is to use pandas selectors in your inner loop. For example, instead of doing:

for after,A in df3.iterrows():
    d_m = magnitude_differential(M['Magnitude'],A['Magnitude'])
    if d_m > 0:
        ...
    else: continue

You can do:

d_m_larger_than_zero = df3[magnitude_differential(M['Magnitude'], df3['Magnitude']) > 0]

The DataFrame d_m_larger_than_zero will only contain the rows with a positive magnitude differential. This is in general several orders of magnitude faster than iterating over the frame. You can even use it before the main loop. Instead of:

for main,M in df3.iterrows():
   if M['Magnitude'] >= minimum_mainshock_magnitude:
        ...

Do:

mainshock_events = df3[df3['Magnitude'] >= minimum_mainshock_magnitude]
for main,M in mainshock_events.iterrows():  # Only mainshocks here
    ...

Is there a fast way to iterate through a pandas data frame where calculations are involved?

1 Answers1