I have a situation where I have a very large dataframe that has airplane location data with 5 rows per timestamp (see table below for an example of 1 timestamp... 3 of the rows just have dots but imagine they have plane_nums and locations)
+---------------------------+-----------+-----------+-----------+-----------+
| utc_time | plane_num | lat | lon | height |
+---------------------------+-----------+-----------+-----------+-----------+
| 2021-06-02T05:01:40+00:00 | 1| 51.759014 | -1.256688 | 47.337597 |
| 2021-06-02T05:01:40+00:00 | 2| 41.758014 | 1.346678 | 41.632522 |
| ... | ... | ... | ... | ... |
| ... | ... | ... | ... | ... |
| ... | ... | ... | ... | ... |
+---------------------------+-----------+-----------+-----------+-----------+
For each time stamp, I want to collapse this dataframe down to 1 row which has various calculations of the various distances between a set of 2 planes and have all of their lat/lon in one row.
I know I could easily accomplish this in pyspark by filtering the dataframe down to each plane (there will always be a plane 1, 2, 3, 4, 5 per timestamp). See code below which creates a dataframe for 1 timestamp, but imagine there are millions of timestamps.
dfAirplaneData.filter = spark.createDataFrame([
(2021-06-02T05:01:40+00:00, 1, 51.759014, -1.256688, 47.337597),
(2021-06-02T05:01:40+00:00, 2, 41.758014, 1.346678, 41.632522),
(2021-06-02T05:01:40+00:00, 3, 41.758014, 1.346678, 11.632522),
(2021-06-02T05:01:40+00:00, 4, 41.758014, 1.346678, 21.632522),
(2021-06-02T05:01:40+00:00, 5, 11.758014, 1.346678, 41.632522)
], ("utc_time", "plane_num", "lat", "lon", "height"))
dfPlane1 = dfAirplaneData.filter(F.col('plane_num')==1))
dfPlane2 = dfAirplaneData.filter(F.col('plane_num')==2))
dfPlane3 = dfAirplaneData.filter(F.col('plane_num')==3))
And then joining the dataframe back to itself multiple times to get one row, but this feels inefficient. Is there a better way to do this in pyspark (perhaps using .groupBy
with the timestamps)?
I am aware instead of joining, I could use:
w = Window.partitionBy(utc_time)
but I only need 1 row per group, so I would end up doing the calculation for every row/duplicating the work and then filtering down to 1 summary row. Also, I would need to know which plane_num when doing comparisons so was unsure if that was easily doable with Window.partitionBy
I have briefly read about Pandas UDF grouped map and I believe I could use a grouped map to iterate through a pandas dataframe if that is the best approach.
Interested in the most efficient, simple way to solve this. If joining to the same dataframe is the best then I can use that approach.