0

I have a pandas dataframe A of approximately 300000 rows. Each row has a latitude and longitude value.

I also have a second pandas dataframe B of about 10000 rows, which has an ID number, a maximum and minimum latitude, and a maximum and minimum longitude.

For each row in A, I need the ID of the corresponding row in B, such that the latitude and longitude of the row in A is contained within the bounding box represented by the row in B.

So far I have the following:

ID_list = []

for index, row in A.iterrows():
    filtered_B = B.apply(lambda x : x['ID'] if row['latitude'] >= x['min_latitude']
                                            and row['latitude'] < x['max_latitude'] \
                                            and row['longitude'] >= x['min_longitude'] \
                                            and row['longitude'] < x['max_longitude'] \
                                            else None, axis = 1)
    ID_list.append(B.loc[filtered_B == True]['ID']

The ID_list variable was created with the intention of adding it as an ID column to A. The greater than or equal to and less than conditions are included so that each row in A has only one ID from B.

The above code technically works, but it completes about 1000 rows per minute, which is just not feasible for such a large dataset.

Any tips would be appreciated, thank you.

edit: sample dataframes:

A:

location latitude longitude
1 -33.81263 151.23691
2 -33.994823 151.161274
3 -33.320154 151.662009
4 -33.99019 151.1567332

B:

ID min_latitude max_latitude min_longitude max_longitude
9ae8704 -33.815 -33.810 151.234 151.237
2ju1423 -33.555 -33.543 151.948 151.957
3ef4522 -33.321 -33.320 151.655 151.668
0uh0478 -33.996 -33.990 151.152 151.182

expected output:

ID_list = [9ae8704, 0uh0478, 3ef4522, 0uh0478]
ruby
  • 391
  • 2
  • 9
  • kindly share sample dataframes, with the expected output. – sammywemmy Oct 11 '21 at 06:48
  • You did not add the expected output dataframe – sammywemmy Oct 11 '21 at 07:55
  • Will the values in A definitely be found in B ? If not, what should be the output for that row in A ? and would there be multiple matches ? – EBDS Oct 13 '21 at 15:01
  • if not found, fill with nan. And there will only be one match – ruby Oct 14 '21 at 11:31
  • Try np.vectorize or other alternative mentioned here instead of `apply` and see if it is still slow. https://stackoverflow.com/questions/52673285/performance-of-pandas-apply-vs-np-vectorize-to-create-new-column-from-existing-c – Emma Oct 14 '21 at 18:37
  • Do you need a pyton-only solution ? In case you don't, I would recommand you to use MongoDB, use table B to create documents with geo shapes from you points and query with A through mongodb python library `pymongo`. You can go with a local MondoDB, you don't have that much data. Hope it helps, it is much more suited to your use case than geopandas or any other python library. – Clem G. Oct 19 '21 at 14:57

6 Answers6

3

I would use geopandas to do this, which makes use of rtree indexing.

import geopandas as gpd
from shapely.geometry import box

a_gdf = gpd.GeoDataFrame(a[['location']], geometry=gpd.points_from_xy(a.longitude,
                                                                      a.latitude))
b_gdf = gpd.GeoDataFrame(
    b[['ID']], 
    geometry=[box(*bounds) for _, bounds in b.loc[:, ['min_longitude',
                                                      'min_latitude', 
                                                      'max_longitude', 
                                                      'max_latitude']].iterrows()])

gpd.sjoin(a_gdf, b_gdf)

Output:

location geometry index_right ID
0 1 POINT (151.23691 -33.81263) 0 9ae8704
1 2 POINT (151.161274 -33.994823) 3 0uh0478
3 4 POINT (151.1567332 -33.99019000000001) 3 0uh0478
2 3 POINT (151.662009 -33.320154) 2 3ef4522
2

We can create an multi-interval-index on b and then use regular loc index into it with tuples from the rows of a. Interval indexes are useful in situations like this when we have a table of low and high values to bucket a variable into.

from io import StringIO

import pandas as pd

a = pd.read_table(StringIO("""
    location    latitude    longitude
    1   -33.81263   151.23691
    2   -33.994823  151.161274
    3   -33.320154  151.662009
    4   -33.99019   151.1567332
"""), sep='\s+')

b = pd.read_table(StringIO("""
    ID  min_latitude    max_latitude    min_longitude   max_longitude
    9ae8704     -33.815     -33.810     151.234     151.237
    2ju1423     -33.555     -33.543     151.948     151.957
    3ef4522     -33.321     -33.320     151.655     151.668
    0uh0478     -33.996     -33.990     151.152     151.182
"""), sep='\s+')

lat_index = pd.IntervalIndex.from_arrays(b['min_latitude'], b['max_latitude'], closed='left')
lon_index = pd.IntervalIndex.from_arrays(b['min_longitude'], b['max_longitude'], closed='left')
index = pd.MultiIndex.from_tuples(list(zip(lat_index, lon_index)), names=['lat_range', 'lon_range'])
b = b.set_index(index)

print(b.loc[list(zip(a.latitude, a.longitude)), 'ID'].tolist())

The above will even handle rows of a that have no corresponding row in b by gracefully filling in those values with nan.

Kyle Parsons
  • 1,475
  • 6
  • 14
  • interval index are fast and the right approach; however, if the intervals are overlapping, your solution won't suffice – sammywemmy Oct 19 '21 at 06:01
  • Overlapping intervals do cause unexpected behavior, but the comments to the question clarify that the intervals will not overlap. – Kyle Parsons Oct 19 '21 at 13:59
1

A good option for this might be to perform a cross-product merge and drop the undesirable columns. For example, you might do:

AB_cross = A.merge(
    B
    how = "cross"
)

Now we have a giant dataframe with all the possible matchings where IDs in B (or might not, we don't know yet) might have boundaries qualifying for the points in A. This is fast but makes a large dataset in memory, since we now have a dataset that is 30000x10000 rows long.

Now, we need to apply our logic by filtering the dataset accordingly. This is a numpy process (as far as I'm aware), so it's vectorized and very fast! I will also say that it might be easier to use between to make your code a bit more semantic.

Note that below I use .between(inclusive = 'left') to represent the fact that you want to look to see if the long/lat is min_long <= long < max_long (the inclusive inequality is on the left side of the equation).

ID_list = AB_cross['ID'].loc[
    AB_cross['longitude'].between(AB_cross['min_longitude'], AB_cross['max_longitude'], inclusive = 'left') &
    AB_cross['latitude'].between(AB_cross['min_latitude'], AB_cross['max_latitude'], inclusive = 'left')
]
1

A reasonably fast approach could be to pre-sort points by latitudes and longitudes, then iterate over boxes finding points inside the box by latitude (lat_min < lat < lat_max) and longitude (lon_min < lon < lon_max) separately with np.searchsorted and then intersecting them with np.intersect1d.

For 300K points and 10K non-overlapping boxes in my tests it took less than 10 seconds to run.

Here's an example implementation:

# create `ids` series to be populated with box IDs for each point
ids = pd.Series(np.nan, index=a.index)

# create series with points sorted by lats and lons
lats = a['latitude'].sort_values()
lons = a['longitude'].sort_values()

# iterate over boxes
for bi, r in b.set_index('ID').iterrows():
    # find points inside the box by latitude:
    i1, i2 = np.searchsorted(lats, [r['min_latitude'], r['max_latitude']])
    ix_lat = lats.index[i1:i2]
    
    # find points inside the box by longitude:
    j1, j2 = np.searchsorted(lons, [r['min_longitude'], r['max_longitude']])
    ix_lon = lons.index[j1:j2]
    
    # find points inside the box as intersection and set values in ids:
    ix = np.intersect1d(ix_lat, ix_lon)
    ids.loc[ix] = bi

ids.tolist()

Output (on provided sample data):

['9ae8704', '0uh0478', '3ef4522', '0uh0478']
perl
  • 9,826
  • 1
  • 10
  • 22
0

You code consuming enter image description here

and this code

A = pd.DataFrame.from_dict(
    {'location': {0: 1, 1: 2, 2: 3, 3: 4},
 'latitude': {0: -33.81263, 1: -33.994823, 2: -33.320154, 3: -33.99019},
 'longitude': {0: 151.23691, 1: 151.161274, 2: 151.662009, 3: 151.1567332}}
)
B = pd.DataFrame.from_dict(
    {'ID': {0: '9ae8704', 1: '2ju1423', 2: '3ef4522', 3: '0uh0478'},
 'min_latitude': {0: -33.815, 1: -33.555, 2: -33.321, 3: -33.996},
 'max_latitude': {0: -33.81, 1: -33.543, 2: -33.32, 3: -33.99},
 'min_longitude': {0: 151.234, 1: 151.948, 2: 151.655, 3: 151.152},
 'max_longitude': {0: 151.237, 1: 151.957, 2: 151.668, 3: 151.182}}
)


def func(latitude, longitude):
    for y, x in B.iterrows():
        if (latitude >= x.min_latitude and latitude < x.max_latitude and longitude >= x.min_longitude and longitude < x.max_longitude):
            return x['ID']
np.vectorize(func)
A.apply(lambda x: func(x.latitude, x.longitude), axis=1).to_list()

consumes enter image description here

so the new solution is 2.33 times faster

Amir saleem
  • 1,404
  • 1
  • 8
  • 11
0

I think Geopandas would be the best solution for making sure all of your edge cases are covered like meridian/equator crossovers and the like - spatial queries are exactly what Geopandas is designed for. It can be a pain to install though.

One naive approach in numpy (assuming that most of the points don't change signs anywhere) would be to calculate each of your clauses as a sign of a difference and then keep only the matches where all the signs match your criteria.

For lots of intensive, repetitive calculations like this, it's usually better to dump it out of pandas into numpy, process and then put it back into pandas.

a_lats = A.latitude.values.reshape(-1,1)
b_min_lats = B.min_latitude.values.reshape(1,-1)
b_max_lats = B.max_latitude.values.reshape(1,-1)

a_lons = A.longitude.values.reshape(-1,1)
b_min_lons =B.min_longitude.values.reshape(1,-1)
b_max_lons = B.max_longitude.values.reshape(1,-1)

north_of_min_lat = np.sign(a_lats - b_min_lats)
south_of_max_lat = np.sign(b_max_lats - a_lats)

west_of_min_lon = np.sign(a_lons - b_min_lons)
east_of_max_lon = np.sign(b_max_lons - a_lons)

margin_matches = (north_of_min_lat + south_of_max_lat + west_of_min_lon + east_of_max_lon)

match_indexes = (margin_matches == 4).nonzero()

matches = [(A.location[i], B.ID[j]) for i, j in zip(match_indexes[0], match_indexes[1])]
print(matches)

PS - You can painlessly run this on a GPU if you use CuPy and replace all references to numpy with cupy.

Mike Holcomb
  • 403
  • 3
  • 9