How to speed up pandas dataframe iteration involving 2 different dataframes with a complex condition?

Question

I have a pandas dataframe A of approximately 300000 rows. Each row has a latitude and longitude value.

I also have a second pandas dataframe B of about 10000 rows, which has an ID number, a maximum and minimum latitude, and a maximum and minimum longitude.

For each row in A, I need the ID of the corresponding row in B, such that the latitude and longitude of the row in A is contained within the bounding box represented by the row in B.

So far I have the following:

ID_list = []

for index, row in A.iterrows():
    filtered_B = B.apply(lambda x : x['ID'] if row['latitude'] >= x['min_latitude']
                                            and row['latitude'] < x['max_latitude'] \
                                            and row['longitude'] >= x['min_longitude'] \
                                            and row['longitude'] < x['max_longitude'] \
                                            else None, axis = 1)
    ID_list.append(B.loc[filtered_B == True]['ID']

The ID_list variable was created with the intention of adding it as an ID column to A. The greater than or equal to and less than conditions are included so that each row in A has only one ID from B.

The above code technically works, but it completes about 1000 rows per minute, which is just not feasible for such a large dataset.

Any tips would be appreciated, thank you.

edit: sample dataframes:

A:

location	latitude	longitude
1	-33.81263	151.23691
2	-33.994823	151.161274
3	-33.320154	151.662009
4	-33.99019	151.1567332

B:

ID	min_latitude	max_latitude	min_longitude	max_longitude
9ae8704	-33.815	-33.810	151.234	151.237
2ju1423	-33.555	-33.543	151.948	151.957
3ef4522	-33.321	-33.320	151.655	151.668
0uh0478	-33.996	-33.990	151.152	151.182

expected output:

ID_list = [9ae8704, 0uh0478, 3ef4522, 0uh0478]

Will the values in A definitely be found in B ? If not, what should be the output for that row in A ? and would there be multiple matches ? — EBDS, Oct 13 '21 at 15:01
if not found, fill with nan. And there will only be one match — ruby, Oct 14 '21 at 11:31
Try np.vectorize or other alternative mentioned here instead of `apply` and see if it is still slow. https://stackoverflow.com/questions/52673285/performance-of-pandas-apply-vs-np-vectorize-to-create-new-column-from-existing-c — Emma, Oct 14 '21 at 18:37
Do you need a pyton-only solution ? In case you don't, I would recommand you to use MongoDB, use table B to create documents with geo shapes from you points and query with A through mongodb python library `pymongo`. You can go with a local MondoDB, you don't have that much data. Hope it helps, it is much more suited to your use case than geopandas or any other python library. — Clem G., Oct 19 '21 at 14:57

score 3 · Accepted Answer · answered Oct 17 '21 at 16:11

I would use geopandas to do this, which makes use of rtree indexing.

import geopandas as gpd
from shapely.geometry import box

a_gdf = gpd.GeoDataFrame(a[['location']], geometry=gpd.points_from_xy(a.longitude,
                                                                      a.latitude))
b_gdf = gpd.GeoDataFrame(
    b[['ID']], 
    geometry=[box(*bounds) for _, bounds in b.loc[:, ['min_longitude',
                                                      'min_latitude', 
                                                      'max_longitude', 
                                                      'max_latitude']].iterrows()])

gpd.sjoin(a_gdf, b_gdf)

Output:

	location	geometry	index_right	ID
0	1	POINT (151.23691 -33.81263)	0	9ae8704
1	2	POINT (151.161274 -33.994823)	3	0uh0478
3	4	POINT (151.1567332 -33.99019000000001)	3	0uh0478
2	3	POINT (151.662009 -33.320154)	2	3ef4522

score 2 · Answer 2 · answered Oct 13 '21 at 19:32

We can create an multi-interval-index on b and then use regular loc index into it with tuples from the rows of a. Interval indexes are useful in situations like this when we have a table of low and high values to bucket a variable into.

from io import StringIO

import pandas as pd

a = pd.read_table(StringIO("""
    location    latitude    longitude
    1   -33.81263   151.23691
    2   -33.994823  151.161274
    3   -33.320154  151.662009
    4   -33.99019   151.1567332
"""), sep='\s+')

b = pd.read_table(StringIO("""
    ID  min_latitude    max_latitude    min_longitude   max_longitude
    9ae8704     -33.815     -33.810     151.234     151.237
    2ju1423     -33.555     -33.543     151.948     151.957
    3ef4522     -33.321     -33.320     151.655     151.668
    0uh0478     -33.996     -33.990     151.152     151.182
"""), sep='\s+')

lat_index = pd.IntervalIndex.from_arrays(b['min_latitude'], b['max_latitude'], closed='left')
lon_index = pd.IntervalIndex.from_arrays(b['min_longitude'], b['max_longitude'], closed='left')
index = pd.MultiIndex.from_tuples(list(zip(lat_index, lon_index)), names=['lat_range', 'lon_range'])
b = b.set_index(index)

print(b.loc[list(zip(a.latitude, a.longitude)), 'ID'].tolist())

The above will even handle rows of a that have no corresponding row in b by gracefully filling in those values with nan.

interval index are fast and the right approach; however, if the intervals are overlapping, your solution won't suffice — sammywemmy, Oct 19 '21 at 06:01
Overlapping intervals do cause unexpected behavior, but the comments to the question clarify that the intervals will not overlap. — Kyle Parsons, Oct 19 '21 at 13:59

score 1 · Answer 3 · answered Oct 11 '21 at 07:01

A good option for this might be to perform a cross-product merge and drop the undesirable columns. For example, you might do:

AB_cross = A.merge(
    B
    how = "cross"
)

Now we have a giant dataframe with all the possible matchings where IDs in B (or might not, we don't know yet) might have boundaries qualifying for the points in A. This is fast but makes a large dataset in memory, since we now have a dataset that is 30000x10000 rows long.

Now, we need to apply our logic by filtering the dataset accordingly. This is a numpy process (as far as I'm aware), so it's vectorized and very fast! I will also say that it might be easier to use between to make your code a bit more semantic.

Note that below I use .between(inclusive = 'left') to represent the fact that you want to look to see if the long/lat is min_long <= long < max_long (the inclusive inequality is on the left side of the equation).

ID_list = AB_cross['ID'].loc[
    AB_cross['longitude'].between(AB_cross['min_longitude'], AB_cross['max_longitude'], inclusive = 'left') &
    AB_cross['latitude'].between(AB_cross['min_latitude'], AB_cross['max_latitude'], inclusive = 'left')
]

I could split up my dataframe and use this method though, thanks for the suggestion — ruby, Oct 11 '21 at 07:17

score 1 · Answer 4 · answered Oct 15 '21 at 09:29

A reasonably fast approach could be to pre-sort points by latitudes and longitudes, then iterate over boxes finding points inside the box by latitude (lat_min < lat < lat_max) and longitude (lon_min < lon < lon_max) separately with np.searchsorted and then intersecting them with np.intersect1d.

For 300K points and 10K non-overlapping boxes in my tests it took less than 10 seconds to run.

Here's an example implementation:

# create `ids` series to be populated with box IDs for each point
ids = pd.Series(np.nan, index=a.index)

# create series with points sorted by lats and lons
lats = a['latitude'].sort_values()
lons = a['longitude'].sort_values()

# iterate over boxes
for bi, r in b.set_index('ID').iterrows():
    # find points inside the box by latitude:
    i1, i2 = np.searchsorted(lats, [r['min_latitude'], r['max_latitude']])
    ix_lat = lats.index[i1:i2]
    
    # find points inside the box by longitude:
    j1, j2 = np.searchsorted(lons, [r['min_longitude'], r['max_longitude']])
    ix_lon = lons.index[j1:j2]
    
    # find points inside the box as intersection and set values in ids:
    ix = np.intersect1d(ix_lat, ix_lon)
    ids.loc[ix] = bi

ids.tolist()

Output (on provided sample data):

['9ae8704', '0uh0478', '3ef4522', '0uh0478']

score 0 · Answer 5 · answered Oct 15 '21 at 11:03

You code consuming

and this code

A = pd.DataFrame.from_dict(
    {'location': {0: 1, 1: 2, 2: 3, 3: 4},
 'latitude': {0: -33.81263, 1: -33.994823, 2: -33.320154, 3: -33.99019},
 'longitude': {0: 151.23691, 1: 151.161274, 2: 151.662009, 3: 151.1567332}}
)
B = pd.DataFrame.from_dict(
    {'ID': {0: '9ae8704', 1: '2ju1423', 2: '3ef4522', 3: '0uh0478'},
 'min_latitude': {0: -33.815, 1: -33.555, 2: -33.321, 3: -33.996},
 'max_latitude': {0: -33.81, 1: -33.543, 2: -33.32, 3: -33.99},
 'min_longitude': {0: 151.234, 1: 151.948, 2: 151.655, 3: 151.152},
 'max_longitude': {0: 151.237, 1: 151.957, 2: 151.668, 3: 151.182}}
)


def func(latitude, longitude):
    for y, x in B.iterrows():
        if (latitude >= x.min_latitude and latitude < x.max_latitude and longitude >= x.min_longitude and longitude < x.max_longitude):
            return x['ID']
np.vectorize(func)
A.apply(lambda x: func(x.latitude, x.longitude), axis=1).to_list()

consumes

so the new solution is 2.33 times faster

A better test would probably be on a large array – sammywemmy Oct 19 '21 at 05:58 — sammywemmy, Oct 19 '21 at 05:58

Mike Holcomb · Answer 6 · 2021-10-20T02:36:16.660

I think Geopandas would be the best solution for making sure all of your edge cases are covered like meridian/equator crossovers and the like - spatial queries are exactly what Geopandas is designed for. It can be a pain to install though.

One naive approach in numpy (assuming that most of the points don't change signs anywhere) would be to calculate each of your clauses as a sign of a difference and then keep only the matches where all the signs match your criteria.

For lots of intensive, repetitive calculations like this, it's usually better to dump it out of pandas into numpy, process and then put it back into pandas.

a_lats = A.latitude.values.reshape(-1,1)
b_min_lats = B.min_latitude.values.reshape(1,-1)
b_max_lats = B.max_latitude.values.reshape(1,-1)

a_lons = A.longitude.values.reshape(-1,1)
b_min_lons =B.min_longitude.values.reshape(1,-1)
b_max_lons = B.max_longitude.values.reshape(1,-1)

north_of_min_lat = np.sign(a_lats - b_min_lats)
south_of_max_lat = np.sign(b_max_lats - a_lats)

west_of_min_lon = np.sign(a_lons - b_min_lons)
east_of_max_lon = np.sign(b_max_lons - a_lons)

margin_matches = (north_of_min_lat + south_of_max_lat + west_of_min_lon + east_of_max_lon)

match_indexes = (margin_matches == 4).nonzero()

matches = [(A.location[i], B.ID[j]) for i, j in zip(match_indexes[0], match_indexes[1])]
print(matches)

PS - You can painlessly run this on a GPU if you use CuPy and replace all references to numpy with cupy.

How to speed up pandas dataframe iteration involving 2 different dataframes with a complex condition?

6 Answers6