In pyspark, I have a variable length array of doubles for which I would like to find the mean. However, the average function requires a single numeric type.
Is there a way to find the average of an array without exploding the array out? I have several different arrays and I'd like to be able to do something like the following:
df.select(col("Segment.Points.trajectory_points.longitude"))
DataFrame[longitude: array]
df.select(avg(col("Segment.Points.trajectory_points.longitude"))).show()
org.apache.spark.sql.AnalysisException: cannot resolve 'avg(Segment.Points.trajectory_points.longitude)' due to data type mismatch: function average requires numeric types, not ArrayType(DoubleType,true);;
If I have 3 unique records with the following arrays, I'd like the mean of these values as the output. This would be 3 mean longitude values.
Input:
[Row(longitude=[-80.9, -82.9]),
Row(longitude=[-82.92, -82.93, -82.94, -82.96, -82.92, -82.92]),
Row(longitude=[-82.93, -82.93])]
Output:
-81.9,
-82.931,
-82.93
I am using spark version 2.1.3.
Explode Solution:
So I've got this working by exploding, but I was hoping to avoid this step. Here's what I did
from pyspark.sql.functions import col
import pyspark.sql.functions as F
longitude_exp = df.select(
col("ID"),
F.posexplode("Segment.Points.trajectory_points.longitude").alias("pos", "longitude")
)
longitude_reduced = long_exp.groupBy("ID").agg(avg("longitude"))
This successfully took the mean. However, since I'll be doing this for several columns, I'll have to explode the same DF several different times. I'll keep working through it to find a cleaner way to do this.