I have a pandas dataframe A of approximately 300000 rows. Each row has a latitude and longitude value.
I also have a second pandas dataframe B of about 10000 rows, which has an ID number, a maximum and minimum latitude, and a maximum and minimum longitude.
For each row in A, I need the ID of the corresponding row in B, such that the latitude and longitude of the row in A is contained within the bounding box represented by the row in B.
So far I have the following:
ID_list = []
for index, row in A.iterrows():
filtered_B = B.apply(lambda x : x['ID'] if row['latitude'] >= x['min_latitude']
and row['latitude'] < x['max_latitude'] \
and row['longitude'] >= x['min_longitude'] \
and row['longitude'] < x['max_longitude'] \
else None, axis = 1)
ID_list.append(B.loc[filtered_B == True]['ID']
The ID_list variable was created with the intention of adding it as an ID column to A. The greater than or equal to and less than conditions are included so that each row in A has only one ID from B.
The above code technically works, but it completes about 1000 rows per minute, which is just not feasible for such a large dataset.
Any tips would be appreciated, thank you.
edit: sample dataframes:
A:
location | latitude | longitude |
---|---|---|
1 | -33.81263 | 151.23691 |
2 | -33.994823 | 151.161274 |
3 | -33.320154 | 151.662009 |
4 | -33.99019 | 151.1567332 |
B:
ID | min_latitude | max_latitude | min_longitude | max_longitude |
---|---|---|---|---|
9ae8704 | -33.815 | -33.810 | 151.234 | 151.237 |
2ju1423 | -33.555 | -33.543 | 151.948 | 151.957 |
3ef4522 | -33.321 | -33.320 | 151.655 | 151.668 |
0uh0478 | -33.996 | -33.990 | 151.152 | 151.182 |
expected output:
ID_list = [9ae8704, 0uh0478, 3ef4522, 0uh0478]