Python: simplifying code by writing it in a more Pandas specific way

Question

I wrote some code that finds the distance between gps coordinates based on machines having the same serial numbers looking at

Fast Haversine Approximation (Python/Pandas)

But I believe it will be more efficient if it can be simplified to using iterrows or df.apply; however, I cannot seems to figure it out.

Since I need to only execute the function when ser_no[i] == ser_no[i+1] and insert a NaN value at the location where the ser_no changes, I cannot seem to apply the Pandas methodology to make the code more efficient. I have looked at:

Unfortunately, I don't readily see the leap I need to make even after looking over these posts.

What I have:

def haversine(lat1, long1, lat2, long2):
    r = 6371  # radius of Earth in km
    # convert decimals to degrees
    lat1, long1, lat2, long2 = map(np.radians, [lat1, long1, lat2, long2])
    # haversine formula
    lat = lat2 - lat1
    lon = long2 - long1
    a = np.sin(lat/2)**2 + np.cos(lat1)*np.cos(lat2)*np.sin(lon/2)**2
    c = 2*np.arcsin(np.sqrt(a))
    d = r*c
    return d
# pre-allocate vector    
hdist = np.zeros(len(mttt_pings.index), dtype = float)    
# haversine loop calculation
for i in range(0, len(mttt_pings.index) - 1):
    '''
    when the ser_no from i and i + 1 are the same calculate the distance
    between them using the haversine formula and put the distance in the
    i + 1 location
    '''
    if mttt_pings.ser_no.loc[i] == mttt_pings.ser_no[i + 1]:
        hdist[i + 1] = haversine(mttt_pings.EQP_GPS_SPEC_LAT_CORD[i], \
        mttt_pings.EQP_GPS_SPEC_LONG_CORD[i], \
        mttt_pings.EQP_GPS_SPEC_LAT_CORD[i + 1], \
        mttt_pings.EQP_GPS_SPEC_LONG_CORD[i + 1])
    else:
        hdist = np.insert(hdist, i, np.nan)
    '''
    when ser_no i and i + 1 are not the same, insert NaN at the ith location
    '''

Could you post a sample of your data? – AGS Mar 24 '16 at 18:35 — AGS, Mar 24 '16 at 18:35

root · Accepted Answer · 2016-03-25T16:54:42.593

The main idea is to utilize shift to check consecutive rows. I'm also writing a get_dist function just wraps your existing distance function to make things more readable for when I use apply to compute distances.

def get_dist(row):
    lat1 = row['EQP_GPS_SPEC_LAT_CORD']
    long1 = row['EQP_GPS_SPEC_LONG_CORD']
    lat2 = row['EQP_GPS_SPEC_LAT_CORD_2']
    long2 = row['EQP_GPS_SPEC_LONG_CORD_2']
    return haversine(lat1, long1, lat2, long2)

# Find consecutive rows with matching ser_no, and get coordinates.
coord_cols = ['EQP_GPS_SPEC_LAT_CORD', 'EQP_GPS_SPEC_LONG_CORD']
matching_ser = mttt_pings['ser_no'] == mttt_pings['ser_no'].shift(1)
shift_coords = mttt_pings.shift(1).loc[matching_ser, coord_cols]

# Join shifted coordinates and compute distances.
mttt_pings_shift = mttt_pings.join(shift_coords, how='inner', rsuffix='_2')
mttt_pings['hdist'] = mttt_pings_shift.apply(get_dist, axis=1)

In the above code, I've added the distances to your dataframe. If you want to get the result as a numpy array, you can do:

hdist = mttt_pings['hdist'].values

As a side note, you may want to consider using geopy.distance.vincenty to compute distances between lat/long coordinates. In general, vincenty is more accurate than haversine, although it may take longer to compute. Very minor modifications to the get_dist function are required to use vincenty.

from geopy.distance import vincenty

def get_dist(row):
    lat1 = row['EQP_GPS_SPEC_LAT_CORD']
    long1 = row['EQP_GPS_SPEC_LONG_CORD']
    lat2 = row['EQP_GPS_SPEC_LAT_CORD_2']
    long2 = row['EQP_GPS_SPEC_LONG_CORD_2']
    return vincenty((lat1, long1), (lat2, long2)).km

Python: simplifying code by writing it in a more Pandas specific way

1 Answers1