0

I have a retail dataset in pyspark.sql dataframe with many store and for each store i have the longitude and latitude, i'm trying to do two things :

  • find the 5 store nearest neighbors of each store (in a dict or anything else)
  • create a columns as the distance of store with a fixed point(capitals for exemple)

dataframe looks like

|x_latitude|y_longitude|id_store|
+----------+-----------+--------+
| 45.116099|   7.712317|     355|
| 45.116099|   7.712317|     355|
| 45.116099|   7.712317|     355|
| 45.116099|   7.712317|     355|

I tried to adapt haversine python func to pyspark with udf but i'm stuck with methodology of how to do it

def haversine(lon1, lat1, lon2, lat2):

"""
Calculate the great circle distance between two points 
on the earth 
"""
# convert decimal degrees to radians 
lon1, lat1, lon2, lat2 = map(F.radians, [lon1, lat1, lon2, lat2])

# haversine formula 
dlon = lon2 - lon1 
dlat = lat2 - lat1 
a = F.sin(dlat/2)**2 + F.cos(lat1) * F.cos(lat2) * F.sin(dlon/2)**2
c = 2 * F.atan2( F.sqrt(a), F.sqrt(1-a)) 
r = 6371 # Radius of earth in kilometers
return c * r

@udf('double')
def closest(v):
    return F.min(lambda p: haversine(v['lat'],v['lon'], p['lat'],p['lon']))

but i only have lat/long of store lat/long2 will be (i think) the lat/long of another store but i don't know how to iterate over dataframe to calculate distance from a fixed store with the rest.

user9176398
  • 441
  • 1
  • 4
  • 15
  • For the first bullet point, see [this post](https://stackoverflow.com/questions/48174484/new-dataframe-column-as-a-generic-function-of-other-rows-spark). You're going to have to do a cartesian product. It's going to be slow. The second bullet point should be fairly easy to implement, and you can do it [without using a udf](https://stackoverflow.com/a/38297050/5858851). – pault Oct 11 '18 at 22:00
  • @user9176398, the link https://stackoverflow.com/questions/38994903/how-to-sum-distances-between-data-points-in-a-dataset-using-pyspark might be helpful, the function will be haversine which you created in pyspark – Devarshi Mandal Jul 13 '19 at 23:35

0 Answers0