1

I have a dataframe similar to

+----+-----+-------+------+------+------+
| cod| name|sum_vol|  date|   lat|   lon|
+----+-----+-------+------+------+------+
|aggc|23124|     37|201610|-15.42|-32.11|
|aggc|23124|     19|201611|-15.42|-32.11|
| abc|  231|     22|201610|-26.42|-43.11|
| abc|  231|     22|201611|-26.42|-43.11|
| ttx|  231|     10|201610|-22.42|-46.11|
| ttx|  231|     10|201611|-22.42|-46.11|
| tty|  231|     25|201610|-25.42|-42.11|
| tty|  231|     45|201611|-25.42|-42.11|
|xptx|  124|     62|201611|-26.43|-43.21|
|xptx|  124|    260|201610|-26.43|-43.21|
|xptx|23124|     50|201610|-26.43|-43.21|
|xptx|23124|     50|201611|-26.43|-43.21|
+----+-----+-------+------+------+------+

Where for each name I have a few different lat lon on the same dataframe. I would like to use the shapely function to calculate the centroid for each user:

Point(lat, lon).centroid()

This UDF would be able to calculate it:

from shapely.geometry import MultiPoint
def f(x):
    return list(MultiPoint(tuple(x.values)).centroid.coords[0])

get_centroid = udf(lambda x: f(x), DoubleType())

But how can I apply it to a list of coordinates of each user? It seems that a UDAF on a group by is not a viable solution in this case.

Community
  • 1
  • 1
Ivan
  • 19,560
  • 31
  • 97
  • 141
  • Trying to do something similar for grouping events based on how close they occurred geographically, were you able to find a solution? Thanks! – Christa Jan 27 '21 at 07:44

1 Answers1

1

You want:

  • Execute 3rd party plain Python function
  • Which is not associative or commutative

The only choice you have is:

  • group records (you can use RDD.groupBy or collect_list).
  • apply the function.
  • flatMap (RDD) or join (DF).