I am running a groupBy()
on a dataset having several millions of records and want to save the resulting output (a PySpark GroupedData
object) so that I can de-serialize it later and resume from that point (running aggregations on top of that as needed).
df.groupBy("geo_city")
<pyspark.sql.group.GroupedData at 0x10503c5d0>
I want to avoid converting the GroupedData object into DataFrames or RDDs in order to save it to text file or Parquet/Avro format (as the conversion operation is expensive). Is there some other efficient way to store the GroupedData
object into some binary format for faster read/write? Possibly some equivalent of pickle in Spark?