0

I have a dataset with venue_id (about 1,500 of them), physical address, latitude, and longitude.

  • I want to create a column named 'overlap', which counts the number of overlapping venue_ids if any.

  • So for example, for venue_id == 1, within 2km radius if there are any other venue_ids that overlaps in terms of 2km radius, count it and save it in column 'overlap'. If there are 2 venue_ids that overlaps with venue_id == 1, 'overlap' would equal to 2.

So far, I tried first visualizing it with 'folium'

import pandas as pd
import folium

m = folium.Map(location=[37.553975551114476, 126.97545224493899],
               zoom_start=10)

locations = df['lat'], df['lng']

df = df.dropna(how='any')
print(df.isna().sum())


for _, row in df.iterrows():
    folium.Circle(location=[row['lat'], row['lng']],
                        radius=2000).add_to(m)

m.save("index.html")

The problem is that folium's Circle would draw a circle in 'pixel' if I understand correctly, and it is fixed to the base 'zoom-level' I've selected creating the base map.

  • My best guess is to utilize 'haversine' package, but if there are better ways to do the job, would any of you be able to provide some advice?

p.s. There is no need to actually visualize the result as long as that 2km radius measurements are correctly calculated, I've only tried visualizing it through folium to see if I can 'manually' count the overlapping circles...

Thanks in advance.

Lee Vincent
  • 35
  • 1
  • 6
  • So you don't need the areas? only if there an overlap? – Ulises Bussi Oct 26 '21 at 15:15
  • 1
    If I read the [Documentation](https://python-visualization.github.io/folium/modules.html#module-folium.vector_layers) correctly, ```folium.Circle``` draws in meters, and it draws it as a vector so that it should be independent of zoomlevel. A quick test on my machine functions exactly as the documentation mentions (for ```folium 0.12.1```). It is folium.CircleMarker that uses pixels for the radius – Alfred Rodenboog Oct 26 '21 at 15:41
  • @AlfredRodenboog I read the documentation again, and yes it is independent from zoom function, but I really am not sure if the circles of radius=2000 represent 2kms precisely or not. – Lee Vincent Oct 26 '21 at 23:54
  • @UlisesBussi Sounds like it! Will give it a try asap. Thanks a bunch! – Lee Vincent Oct 26 '21 at 23:56
  • @UlisesBussi yes I just need to count the overlaps per venue_id – Lee Vincent Oct 26 '21 at 23:56

1 Answers1

4

It sounds like the goal here is just to determine how many points are within 2km of any other point within your dataset. The Haversine distance is the way to go in this case. Since you're only interested in a short distance and you have a relatively small number of points, this answer provides the central function. Then it's just a matter of applying it to your data. Here's one approach to do that:

import pandas as pd
import numpy as np

# function from https://stackoverflow.com/a/29546836/4325492
def haversine_np(lon1, lat1, lon2, lat2):
    """
    Calculate the great circle distance between two points
    on the earth (specified in decimal degrees)

    All args must be of equal length.    

    """
    lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])

    dlon = lon2 - lon1
    dlat = lat2 - lat1

    a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2

    c = 2 * np.arcsin(np.sqrt(a))
    km = 6367 * c
    return km

# generate some sample data
lng1, lat1 = np.random.randn(2, 1000)
df = pd.DataFrame(data={'lng':lng1, 'lat':lat1})

# Apply to the data
df['overlap'] = df.apply(lambda x: sum(haversine_np(x[0], x[1], df.lng, df.lat) <= 2) - 1, axis=1)

When applying the function, just count the number of times that another point has a distance <= 2km. We subtract off 1 again since the function is applied to all rows and each point will be 0km from itself.

Brendan A.
  • 1,268
  • 11
  • 16
  • Nice solution, I like the way you vectorized using a custom haversine function. I was having issues with that using the haversine package. Anyway, I think the distance between points should be 4 km, as the OP was interested in overlapping 2k radius circles – Alfred Rodenboog Oct 26 '21 at 15:43
  • @Brendan A. It is definitely what I was looking for and works just fine with the sample data. But I get a weird error message when I try it with my dataframe saying "TypeError: ufunc 'radians' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''". Once I get this part solved I think it will do the trick just fine. – Lee Vincent Oct 27 '21 at 00:16