1

I have dataframe that looks like the following. I want to be able to find an average and put in a new_column. I can find avg using udf, but cannot put it in a column. It would be nice, if you can help without udf. Otherwise, any help with current solution is welcome.

from pyspark.sql.types import StructType,StructField 
from pyspark.sql.types import StringType, IntegerType, ArrayType
data = [
("Smith","[55, 65, 75]"),
("Anna","[33, 44, 55]"),
("Williams","[9.5, 4.5, 9.7]"),    
]
     
schema = StructType([
StructField('name', StringType(), True),
StructField('some_value', StringType(), True)
])

df = spark.createDataFrame(data = data, schema= schema)
df.show(truncate=False)

+--------+---------------+
|name    |some_value     |
+--------+---------------+
|Smith   |[55, 65, 75]   |
|Anna    |[33, 44, 55]   |
|Williams|[9.5, 4.5, 9.7]|
+--------+---------------+

A solution is like this,

array_mean = F.udf(lambda x: float(np.mean(x)), FloatType())

(from Find mean of pyspark array<double>) returns a dataframe not a new column.

Any help is welcome. Thank you.

mck
  • 40,932
  • 13
  • 35
  • 50
Droid-Bird
  • 1,417
  • 5
  • 19
  • 43

2 Answers2

3

You have a string column that looks like an array, instead of an array column, so you need to convert the datatype in the UDF as well:

import json
import numpy as np
import pyspark.sql.functions as F

array_mean = F.udf(lambda x: float(np.mean(json.loads(x))), 'float')
df2 = df.withColumn('mean_value', array_mean('some_value'))

df2.show()
+--------+---------------+----------+
|    name|     some_value|mean_value|
+--------+---------------+----------+
|   Smith|   [55, 65, 75]|      65.0|
|    Anna|   [33, 44, 55]|      44.0|
|Williams|[9.5, 4.5, 9.7]|       7.9|
+--------+---------------+----------+
mck
  • 40,932
  • 13
  • 35
  • 50
1

Coming from Pandasand newbie to pyspark, I went the long way.

  1. Strip []

  2. split to turn into list

  3. explode

  4. mean

    df2 = df.select(df.name,F.regexp_replace('some_value', '[\\]\\[]', "").alias("some_value")).select(df.name, F.split("some_value",",").alias("some_value")).select(df.name, F.explode("some_value").alias("some_value"))
     df2 = df2.withColumn("some_value", df2.some_value.cast('float')).groupBy("name").mean( "some_value") 
    
wwnde
  • 26,119
  • 6
  • 18
  • 32