I have a retail dataset in pyspark.sql dataframe with many store and for each store i have the longitude and latitude, i'm trying to do two things :
- find the 5 store nearest neighbors of each store (in a dict or anything else)
- create a columns as the distance of store with a fixed point(capitals for exemple)
dataframe looks like
|x_latitude|y_longitude|id_store|
+----------+-----------+--------+
| 45.116099| 7.712317| 355|
| 45.116099| 7.712317| 355|
| 45.116099| 7.712317| 355|
| 45.116099| 7.712317| 355|
I tried to adapt haversine python func to pyspark with udf but i'm stuck with methodology of how to do it
def haversine(lon1, lat1, lon2, lat2):
"""
Calculate the great circle distance between two points
on the earth
"""
# convert decimal degrees to radians
lon1, lat1, lon2, lat2 = map(F.radians, [lon1, lat1, lon2, lat2])
# haversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = F.sin(dlat/2)**2 + F.cos(lat1) * F.cos(lat2) * F.sin(dlon/2)**2
c = 2 * F.atan2( F.sqrt(a), F.sqrt(1-a))
r = 6371 # Radius of earth in kilometers
return c * r
@udf('double')
def closest(v):
return F.min(lambda p: haversine(v['lat'],v['lon'], p['lat'],p['lon']))
but i only have lat/long of store lat/long2 will be (i think) the lat/long of another store but i don't know how to iterate over dataframe to calculate distance from a fixed store with the rest.