I would like to apply summary and customized stats functions to all columns, independently and in parallel.
from pyspark.sql.functions import rand, randn
df = sqlContext.range(0, 10).withColumn('uniform', rand(seed=10)).withColumn('normal', randn(seed=27))
I have tried to look for answers such as here and here, but they seem to be giving one value for each row. It seems that udf
and withColumns
are solutions, but I am not sure how to put together to achieve something similar to this :
df.describe().show()
+-------+------------------+-------------------+--------------------+
|summary| id| uniform| normal|
+-------+------------------+-------------------+--------------------+
| count| 10| 10| 10|
| mean| 4.5| 0.5215336029384192|-0.01309370117407197|
| stddev|2.8722813232690143| 0.229328162820653| 0.5756058014772729|
| min| 0|0.19657711634539565| -0.7195024130068081|
| max| 9| 0.9970412477032209| 1.0900096472044518|
+-------+------------------+-------------------+--------------------+
Let's say we have a couple of dummy functions for summary
def mySummary1(df_1_col_only):
return mean(df_1_col_only)*second_largest(df_1_col_only)
def mySummary2(df_1_col_only):
return 10th_decile(df_1_col_only)/median(df_1_col_only)
so that when applying in parallel to all columns
df.something_that_allow_spark_parallel_processing_all_columns(column_list, [mean, mySummary1, mySummary2])
and produce (preferably the following, or the transposed)
+-------+------------------+-------------------+--------------------+
|col_name| mean| mySummary1| mySummary2|
+-------+------------------+-------------------+--------------------+
| id| 10| 10| 10|
|uniform| 4.5| 0.5215336029384192|-0.01309370117407197|
| normal|2.8722813232690143| 0.229328162820653| 0.5756058014772729|
+-------+------------------+-------------------+--------------------+
I would want to utilize the parallel power of Spark to parallelize in both columns list and inside each mySummaryxxx functions.
Another question is what should my df_1_col_only be like for efficiency ? 1-column Spark dataframe ? Is there a way of not copying of duplicating the original dataframe into a bunch of 1-column dataframe ?