I'm very new to pyspark and I'm attempting to transition my pandas code to pyspark. One thing I'm having issues with is aggregating my groupby.
Here is the pandas code:
df_trx_m = train1.groupby('CUSTOMER_NUMBER')['trx'].agg(['mean', 'var'])
I saw this example on AnalyticsVidhya but I'm not sure how to apply that to the code above:
train.groupby('Age').agg({'Purchase': 'mean'}).show()
Output:
+-----+-----------------+
| Age| avg(Purchase)|
+-----+-----------------+
|51-55|9534.808030960236|
|46-50|9208.625697468327|
| 0-17|8933.464640444974|
|36-45|9331.350694917874|
|26-35|9252.690632869888|
| 55+|9336.280459449405|
|18-25|9169.663606261289|
+-----+-----------------+
Any help would be much apprecaited
EDIT:
Here's another attempt:
from pyspark.sql.functions import avg, variance
train1.groupby("CUSTOMER_NUMBER")\
.agg(
avg('repatha_trx').alias("repatha_trx_avg"),
variance('repatha_trx').alias("repatha_trx_Var")
)\
.show(100)
But that is just giving me an empty dataframe.