I am doing a data cleaning with very simple logic.
val inputData= spark.read.parquet(inputDataPath)
val viewMiddleTable = sdk70000DF.where($"type" === "0").select($"field1", $"field2", $field3)
.groupBy($"field1", $"field2", $field3)
.agg(count(lit(1)))
Read parquet data from hdfs, filter, select target fields and group by all fields, then count.
When I check the UI, below things happended.
Input 81.2 GiB Shuffle Write 645.7 GiB
How can the shuffle write data so much bigger than the originally read data? It should be just a little expanded in this situation. Can anyone explain this? Thanks.