Efficient way to calculate area of a 2D polygon in Pyspark for N rows in a group-by

Question

I have a dataframe in pyspark (I get it from reading in a partition with around 1.6 million rows, but often I read in multiple partitions).

For each week of data, there are ~200,000 different timestamps and for each timestamp there are up to 8 different location IDs (x, y coordinates not latitude longitude). In most cases there will be 8 locations, but on the rare occasion there might be 7 or 6. The columns are Week_NUM, TS, X_COORD and Y_COORD (there are other columns but those are the ones relevant for this problem). I want to find the total area occupied by the location IDS (area of a polygon) for each time grouping. I was thinking I would use a pandas grouped map and group by TS, then I would have a pandas UDF that would somehow calculate the area of a polygon for the N rows in each grouping, where each row has an X coord and a Y coord, but I am not sure if this approach is sound or how the function would work that does area of a triangle. In addition, I am not sure if this is an efficient approach.

df.groupby('WEEK_NUM', 'TS').applyInPandas(some_function_that_calcs_area_polygon)

some_function_that_calcs_area_polygon would receive each grouping of up to 8 rows and then maybe use numpy to get the area?

I would go along with your approach. But you have to ensure that the polygon is Convex for the area to make sense. Here's how the formula works out. https://stackoverflow.com/questions/451426/how-do-i-calculate-the-area-of-a-2d-polygon — user238607, Nov 11 '20 at 06:38

Efficient way to calculate area of a 2D polygon in Pyspark for N rows in a group-by

0 Answers0