pyspark groupBy with multiple aggregates (like pandas)

Question

I'm very new to pyspark and I'm attempting to transition my pandas code to pyspark. One thing I'm having issues with is aggregating my groupby.

Here is the pandas code:

df_trx_m = train1.groupby('CUSTOMER_NUMBER')['trx'].agg(['mean', 'var'])

I saw this example on AnalyticsVidhya but I'm not sure how to apply that to the code above:

train.groupby('Age').agg({'Purchase': 'mean'}).show()
Output:
+-----+-----------------+
|  Age|    avg(Purchase)|
+-----+-----------------+
|51-55|9534.808030960236|
|46-50|9208.625697468327|
| 0-17|8933.464640444974|
|36-45|9331.350694917874|
|26-35|9252.690632869888|
|  55+|9336.280459449405|
|18-25|9169.663606261289|
+-----+-----------------+

Any help would be much apprecaited

EDIT:

Here's another attempt:

from pyspark.sql.functions import avg, variance
train1.groupby("CUSTOMER_NUMBER")\
    .agg(
        avg('repatha_trx').alias("repatha_trx_avg"), 
        variance('repatha_trx').alias("repatha_trx_Var")
    )\
    .show(100)

But that is just giving me an empty dataframe.

Your second attempt looks like it should work. Can you provide an [mcve] that reproduces this issue? Please provide a small sample dataframe. Read more on [how to make good reproducible apache spark dataframe examples](https://stackoverflow.com/questions/48427185/how-to-make-good-reproducible-apache-spark-dataframe-examples). — pault, Apr 05 '18 at 14:57

score 8 · Accepted Answer · edited Apr 05 '18 at 14:52

You can import pyspark functions to perform aggregation.

# load function
from pyspark.sql import functions as F

# aggregate data
df_trx_m = train.groupby('Age').agg(
    F.avg(F.col('repatha_trx')).alias('repatha_trx_avg'),
    F.variance(F.col('repatha_trx')).alias('repatha_trx_var')
)

Note that pyspark.sql.functions.variance() returns the population variance. There is another function pyspark.sql.functions.var_samp() for the unbiased sample variance.

pyspark groupBy with multiple aggregates (like pandas)

1 Answers1