I have dataframe that looks like the following. I want to be able to find an average and put in a new_column. I can find avg using udf, but cannot put it in a column. It would be nice, if you can help without udf. Otherwise, any help with current solution is welcome.
from pyspark.sql.types import StructType,StructField
from pyspark.sql.types import StringType, IntegerType, ArrayType
data = [
("Smith","[55, 65, 75]"),
("Anna","[33, 44, 55]"),
("Williams","[9.5, 4.5, 9.7]"),
]
schema = StructType([
StructField('name', StringType(), True),
StructField('some_value', StringType(), True)
])
df = spark.createDataFrame(data = data, schema= schema)
df.show(truncate=False)
+--------+---------------+
|name |some_value |
+--------+---------------+
|Smith |[55, 65, 75] |
|Anna |[33, 44, 55] |
|Williams|[9.5, 4.5, 9.7]|
+--------+---------------+
A solution is like this,
array_mean = F.udf(lambda x: float(np.mean(x)), FloatType())
(from Find mean of pyspark array<double>) returns a dataframe not a new column.
Any help is welcome. Thank you.