How to select rows in Python Dataframe with a condition based on a function using a column

Question

I have a dataframe df that looks like that :

id          id_latlong
1          (46.1988400;5.209562)
2          (46.1988400;5.209562)
3          (46.1988400;5.209562)
4          (46.1988400;5.209562)
5         (46.438805;5.11890299)
6         (46.222993;5.21707600)
7           (46.195183;5.212575)
8           (46.195183;5.212575)
9           (46.195183;5.212575)
10          (48.917459;2.570821)
11          (48.917459;2.570821)

Every row is a location and the data in the column "id_latlong" are coordinates.

I want to select the id of every location that is at less than 800 meters from a defined location :

defined_location_latlong = "(46.1988400;5.209562)"

I have a function that calcule the distance, in meters, between two coordinates:

def distance_btw_coordinates (id_latlong1, id_latlong2) :
    try : 
        R = 6372800  # Earth radius in meters

        lat1 = float(id_latlong1.partition('(')[2].partition(';')[0])
        lon1 = float(id_latlong1.partition(';')[2].partition(')')[0])

        lat2 = float(id_latlong2.partition('(')[2].partition(';')[0])
        lon2 = float(id_latlong2.partition(';')[2].partition(')')[0])

        phi1, phi2 = math.radians(lat1), math.radians(lat2) 
        dphi       = math.radians(lat2 - lat1)
        dlambda    = math.radians(lon2 - lon1)

        a = math.sin(dphi/2)**2 + \
            math.cos(phi1)*math.cos(phi2)*math.sin(dlambda/2)**2

        distance = 2*R*math.atan2(math.sqrt(a), math.sqrt(1 - a))
    except :
        distance = 1000000000

    return distance

In order to select every row that is at less than 800 meters from the defined location, I tried :

df.loc[distance_btw_cohordonates(df['id_latlong'], defined_location_latlong ) < 800]

But it doesn't work :

KeyError: False

It doesn't work because the function takes all the data in the column "id_latlong" at once...

Do you know how I could do this without having to iterate ?

Thank you !

EDIT : I have 500k different defined locations, I would prefer not having to stock the distance between every row in df and every defined location... Is it possible to select every location that is at less than 800 meters without having to stock the distances ?

jezrael · Answer 1 · 2020-05-02T05:17:36.577

I think you need processing function for each value of column separately by Series.apply:

s = df['id_latlong'].apply(lambda x: distance_btw_coordinates(x, defined_location_latlong))
print (s)
0     1000000000
1     1000000000
2     1000000000
3     1000000000
4     1000000000
5     1000000000
6     1000000000
7     1000000000
8     1000000000
9     1000000000
10    1000000000
Name: id_latlong, dtype: int64

df.loc[s < 800]

EDIT:

Is it possible to select every location that is at less than 800 meters without having to stock the distances ?

One idea is use vectorizes function haversine_np, but is necessary change your code for parsing strings to columns and also to numeric:

def haversine_np(lon1, lat1, lon2, lat2):
    """
    Calculate the great circle distance between two points
    on the earth (specified in decimal degrees)

    All args must be of equal length.    

    """
    lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])

    dlon = lon2 - lon1
    dlat = lat2 - lat1

    a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2

    c = 2 * np.arcsin(np.sqrt(a))
    km = 6367 * c
    return km

df[['lat','long']] = df['id_latlong'].str.strip('()').str.split(';', expand=True).astype(float)
print (df)
    id              id_latlong        lat      long
0    1   (46.1988400;5.209562)  46.198840  5.209562
1    2   (46.1988400;5.209562)  46.198840  5.209562
2    3   (46.1988400;5.209562)  46.198840  5.209562
3    4   (46.1988400;5.209562)  46.198840  5.209562
4    5  (46.438805;5.11890299)  46.438805  5.118903
5    6  (46.222993;5.21707600)  46.222993  5.217076
6    7    (46.195183;5.212575)  46.195183  5.212575
7    8    (46.195183;5.212575)  46.195183  5.212575
8    9    (46.195183;5.212575)  46.195183  5.212575
9   10    (48.917459;2.570821)  48.917459  2.570821
10  11    (48.917459;2.570821)  48.917459  2.570821

lat, long = tuple(map(float, defined_location_latlong.strip('()').split(';')))
print (lat, long)
46.19884 5.209562

s = haversine_np(df['long'], df['lat'], lat, long)
print (s)
0     6016.063040
1     6016.063040
2     6016.063040
3     6016.063040
4     6037.462224
5     6017.186477
6     6015.635700
7     6015.635700
8     6015.635700
9     6353.080382
10    6353.080382
dtype: float64

#km output
df.loc[s < 0.8]

EDIT1:

For improve performance of spliting is possible use:

#550000 rows for test
df = pd.concat([df] * 50000, ignore_index=True)

df[['lat1','long1']] = pd.DataFrame([x.strip('()').split(';') for x in df['id_latlong']], index=df.index).astype(float)
df[['lat','long']] = df['id_latlong'].str.strip('()').str.split(';', expand=True).astype(float)

print (df)

In [38]: %timeit df[['lat','long']] = df['id_latlong'].str.strip('()').str.split(';', expand=True).astype(float)
2.49 s ± 722 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [39]: %timeit df[['lat1','long1']] = pd.DataFrame([x.strip('()').split(';') for x in df['id_latlong']], index=df.index).astype(float)
937 ms ± 11.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Thank you. Do you know if it's possible in one step, without having to stock every distance ? — SimbaIsLearning, May 02 '20 at 04:50
It calculates very quickly the distances, but I still have the intermediate step with the saved distances. The dataframe has 700k rows and there is 500k defined locations, that would mean saving 350 billions distances. — SimbaIsLearning, May 02 '20 at 05:10
@SimbaIsLearning hmm, so need faster `df[['lat','long']] = df['id_latlong'].str.strip('()').str.split(';', expand=True).astype(float) ` ? — jezrael, May 02 '20 at 05:11
Can you explain the results of: s = haversine_np(df['long'], df['lat'], lat, long)? It seems like there should be more variation in the distances. — Matthew Borish, May 02 '20 at 08:21
@MatthewBorish - Agree, I use simple vectorized `haversine`, but if OP need it not idea. — jezrael, May 02 '20 at 08:32
I modified my code and try to make it work in different ways. I always have the error message : `RuntimeWarning: invalid value encountered in arcsin c = 2 * np.arcsin(np.sqrt(a)) ` that occurs in the haversine_np function. Do you know where it might come from ? — SimbaIsLearning, May 04 '20 at 22:06

Matthew Borish · Answer 2 · 2020-05-02T08:25:35.710

pd.set_option('display.float_format', lambda x: '%.6f' % x)    
from scipy.spatial import KDTree
import pandas as pd
df = pd.read_clipboard()
print(df)

    id  id_latlong
0   1   (46.1988400;5.209562)
1   2   (46.1988400;5.209562)
2   3   (46.1988400;5.209562)
3   4   (46.1988400;5.209562)
4   5   (46.438805;5.11890299)
5   6   (46.222993;5.21707600)
6   7   (46.195183;5.212575)
7   8   (46.195183;5.212575)
8   9   (46.195183;5.212575)
9   10  (48.917459;2.570821)
10  11  (48.917459;2.570821)

Condition the df.

df = df['id_latlong'].str.split(";", expand=True)
df['lat'] = df[0].str.replace('(', '')
df['lon'] = df[1].str.replace(')', '')
df['lat'] = pd.to_numeric(df['lat'])
df['lon'] = pd.to_numeric(df['lon'])
print(df.head(3))

      0          1           lat         lon
0   (46.1988400 5.209562)   46.19884    5.209562
1   (46.1988400 5.209562)   46.19884    5.209562
2   (46.1988400 5.209562)   46.19884    5.209562

Convert to UTM 31 N so we can have meters distance instead of lat/long.

dl_df = pd.DataFrame({'lat':[46.1988400], 'lon':5.209562})
dl_gdf = gpd.GeoDataFrame(dl_df, geometry=gpd.points_from_xy(dl_df.lon, dl_df.lat))

dl_gdf.crs = 4326
dl_gdf = dl_gdf.to_crs(32631)

dl_gdf['E'] = dl_gdf['geometry'].x
dl_gdf['N'] = dl_gdf['geometry'].y

print(dl_gdf)
    lat lon geometry    E   N
0   46.198840   5.209562    POINT (670475.888 5118513.417)  670475.888071   5118513.416524

gdf = gpd.GeoDataFrame(df, geometry=gpd.points_from_xy(df.lon, df.lat))

gdf.crs = 4326
gdf = gdf.to_crs(32631)

gdf['E'] = gdf['geometry'].x
gdf['N'] = gdf['geometry'].y

print(gdf)

0   1   lat lon geometry    E   N
0   (46.1988400 5.209562)   46.198840   5.209562    POINT (670475.888 5118513.417)  670475.888071   5118513.416524
1   (46.1988400 5.209562)   46.198840   5.209562    POINT (670475.888 5118513.417)  670475.888071   5118513.416524
2   (46.1988400 5.209562)   46.198840   5.209562    POINT (670475.888 5118513.417)  670475.888071   5118513.416524
3   (46.1988400 5.209562)   46.198840   5.209562    POINT (670475.888 5118513.417)  670475.888071   5118513.416524
4   (46.438805  5.11890299) 46.438805   5.118903    POINT (662767.928 5144985.070)  662767.928322   5144985.069816
5   (46.222993  5.21707600) 46.222993   5.217076    POINT (670980.609 5121213.169)  670980.608959   5121213.168557
6   (46.195183  5.212575)   46.195183   5.212575    POINT (670719.678 5118113.575)  670719.677504   5118113.574785
7   (46.195183  5.212575)   46.195183   5.212575    POINT (670719.678 5118113.575)  670719.677504   5118113.574785
8   (46.195183  5.212575)   46.195183   5.212575    POINT (670719.678 5118113.575)  670719.677504   5118113.574785
9   (48.917459  2.570821)   48.917459   2.570821    POINT (468556.965 5418368.922)  468556.964829   5418368.922484
10  (48.917459  2.570821)   48.917459   2.570821    POINT (468556.965 5418368.922)  468556.964829   5418368.922484

Find the distances in meters with KDTree. If there was more than one row in dl_gdf the indices in new_df would be the closest point.

join_cols = ['E', 'N']
tree = KDTree(dl_gdf[join_cols])
distance, indices = tree.query(gdf[join_cols])

new_df = pd.DataFrame({'distance':distance, 'indices': indices})

print(new_df)

    distance    indices
0   0.000000    0
1   0.000000    0
2   0.000000    0
3   0.000000    0
4   27571.018688    0
5   2746.525845 0
6   468.301937  0
7   468.301937  0
8   468.301937  0
9   361503.217161   0
10  361503.217161   0

Get the rows with points < 800m.

less_than_800m_df = new_df.loc[new_df['distance'] < 800]
print(less_than_800m_df)

    distance    indices
0   0.000000    0
1   0.000000    0
2   0.000000    0
3   0.000000    0
6   468.301937  0
7   468.301937  0
8   468.301937  0

Here is a validation image from QGIS. The precision isn't great with the manual measurement, but the results in the new_df look correct.

Here is a closeup for new_df idxs 6,7, and 8.

How to select rows in Python Dataframe with a condition based on a function using a column

2 Answers2