3

I'm very new to pyspark and I'm attempting to transition my pandas code to pyspark. One thing I'm having issues with is aggregating my groupby.

Here is the pandas code:

df_trx_m = train1.groupby('CUSTOMER_NUMBER')['trx'].agg(['mean', 'var'])

I saw this example on AnalyticsVidhya but I'm not sure how to apply that to the code above:

train.groupby('Age').agg({'Purchase': 'mean'}).show()
Output:
+-----+-----------------+
|  Age|    avg(Purchase)|
+-----+-----------------+
|51-55|9534.808030960236|
|46-50|9208.625697468327|
| 0-17|8933.464640444974|
|36-45|9331.350694917874|
|26-35|9252.690632869888|
|  55+|9336.280459449405|
|18-25|9169.663606261289|
+-----+-----------------+

Any help would be much apprecaited

EDIT:

Here's another attempt:

from pyspark.sql.functions import avg, variance
train1.groupby("CUSTOMER_NUMBER")\
    .agg(
        avg('repatha_trx').alias("repatha_trx_avg"), 
        variance('repatha_trx').alias("repatha_trx_Var")
    )\
    .show(100)

But that is just giving me an empty dataframe.

pault
  • 41,343
  • 15
  • 107
  • 149
madsthaks
  • 2,091
  • 6
  • 25
  • 46
  • Your second attempt looks like it should work. Can you provide an [mcve] that reproduces this issue? Please provide a small sample dataframe. Read more on [how to make good reproducible apache spark dataframe examples](https://stackoverflow.com/questions/48427185/how-to-make-good-reproducible-apache-spark-dataframe-examples). – pault Apr 05 '18 at 14:57

1 Answers1

8

You can import pyspark functions to perform aggregation.

# load function
from pyspark.sql import functions as F

# aggregate data
df_trx_m = train.groupby('Age').agg(
    F.avg(F.col('repatha_trx')).alias('repatha_trx_avg'),
    F.variance(F.col('repatha_trx')).alias('repatha_trx_var')
)

Note that pyspark.sql.functions.variance() returns the population variance. There is another function pyspark.sql.functions.var_samp() for the unbiased sample variance.

pault
  • 41,343
  • 15
  • 107
  • 149
YOLO
  • 20,181
  • 5
  • 20
  • 40