I have a dataframe similar to
+----+-----+-------+------+------+------+
| cod| name|sum_vol| date| lat| lon|
+----+-----+-------+------+------+------+
|aggc|23124| 37|201610|-15.42|-32.11|
|aggc|23124| 19|201611|-15.42|-32.11|
| abc| 231| 22|201610|-26.42|-43.11|
| abc| 231| 22|201611|-26.42|-43.11|
| ttx| 231| 10|201610|-22.42|-46.11|
| ttx| 231| 10|201611|-22.42|-46.11|
| tty| 231| 25|201610|-25.42|-42.11|
| tty| 231| 45|201611|-25.42|-42.11|
|xptx| 124| 62|201611|-26.43|-43.21|
|xptx| 124| 260|201610|-26.43|-43.21|
|xptx|23124| 50|201610|-26.43|-43.21|
|xptx|23124| 50|201611|-26.43|-43.21|
+----+-----+-------+------+------+------+
Where for each name I have a few different lat lon on the same dataframe. I would like to use the shapely
function to calculate the centroid for each user:
Point(lat, lon).centroid()
This UDF would be able to calculate it:
from shapely.geometry import MultiPoint
def f(x):
return list(MultiPoint(tuple(x.values)).centroid.coords[0])
get_centroid = udf(lambda x: f(x), DoubleType())
But how can I apply it to a list of coordinates of each user? It seems that a UDAF on a group by is not a viable solution in this case.