I am using Spark sql dataframes to perform a groupby operation and then compute the mean and median of data for each group. The original amount of data is about 1 terabyte.
val df_result = df.filter($"DayOfWeek" <= 5).groupBy("id").agg(
count("Error").as("Count"),
avg("Error").as("MeanError"),
callUDF("percentile_approx", col("Error"), lit(0.05)).as("5thError"),
callUDF("percentile_approx", col("Error"), lit(0.5)).as("MedianError"),
callUDF("percentile_approx", col("Error"), lit(0.95)).as("95thError")).
filter($"Count" > 1000)
df_result.orderBy(asc("MeanError")).limit(5000)
.write.format("csv").option("header", "true").save("/user/foo.bar/result.csv")
When I run that query, my job gets stuck and does not complete. How do I go about debugging the problem? Is there a key imbalance that causes the groupby()
to get stuck?