I have a dataframe in pyspark (I get it from reading in a partition with around 1.6 million rows, but often I read in multiple partitions).
For each week of data, there are ~200,000 different timestamps and for each timestamp there are up to 8 different location IDs (x, y coordinates not latitude longitude). In most cases there will be 8 locations, but on the rare occasion there might be 7 or 6. The columns are Week_NUM, TS, X_COORD and Y_COORD (there are other columns but those are the ones relevant for this problem). I want to find the total area occupied by the location IDS (area of a polygon) for each time grouping. I was thinking I would use a pandas grouped map and group by TS, then I would have a pandas UDF that would somehow calculate the area of a polygon for the N rows in each grouping, where each row has an X coord and a Y coord, but I am not sure if this approach is sound or how the function would work that does area of a triangle. In addition, I am not sure if this is an efficient approach.
df.groupby('WEEK_NUM', 'TS').applyInPandas(some_function_that_calcs_area_polygon)
some_function_that_calcs_area_polygon
would receive each grouping of up to 8 rows and then maybe use numpy to get the area?