Replace NA with mean Pyspark with help of window function

Question

I want to replace NA with mean and median based on multiple columns with help of window function in pyspark

Sample Input:

Required Output for mean:

Required output for median: Output will be same as above but need to replace based on median and can't find function in pyspark.sql.functions in pyspark

please provide sample data, or look at https://stackoverflow.com/questions/48427185/how-to-make-good-reproducible-apache-spark-examples to help you in asking comprehensive questions. — murtihash, Feb 25 '20 at 05:57

murtihash · Accepted Answer · 2020-02-25T06:40:16.867

0

Creating sample dataframe:

from pyspark.sql.window import Window
from pyspark.sql import functions as F

list=([1,5,4],
    [1,5,None],
    [1,5,4],
    [1,5,4],
    [2,5,1],
    [2,5,2],
    [2,5,None],
    [2,5,None])
df=spark.createDataFrame(list,['I_id','p_id','xyz'])
df.show()

+----+----+----+
|I_id|p_id| xyz|
+----+----+----+
|   1|   5|   4|
|   1|   5|null|
|   1|   5|   4|
|   1|   5|   4|
|   2|   5|   1|
|   2|   5|   2|
|   2|   5|null|
|   2|   5|null|
+----+----+----+

Creating Window and filling nulls:

w=Window().partitionBy("I_id","p_id")
df.withColumn("mean",F.mean("xyz").over(w))\
.withColumn("xyz", F.when(F.col("xyz").isNull(),F.col("mean")).otherwise(F.col("xyz")))\
.drop("mean").show()

+----+----+---+
|I_id|p_id|xyz|
+----+----+---+
|   1|   5|4.0|
|   1|   5|4.0|
|   1|   5|4.0|
|   1|   5|4.0|
|   2|   5|1.0|
|   2|   5|2.0|
|   2|   5|1.5|
|   2|   5|1.5|
+----+----+---+

edited Feb 25 '20 at 06:40

answered Feb 25 '20 at 06:18

murtihash

8,030
1
14
26

1

one correction in above code partition cols should be I_id and p_id – Vigneshwar Thiyagarajan Feb 25 '20 at 06:38
can you give similar example for NA imputation with median? – Vigneshwar Thiyagarajan Feb 26 '20 at 06:45
@VigneshwarThiyagarajan i cant answer here because it is closed. open a new question – murtihash Feb 26 '20 at 06:56
I posted new question for na replacement with median – Vigneshwar Thiyagarajan Feb 26 '20 at 07:11
ill check it out – murtihash Feb 26 '20 at 07:17

Replace NA with mean Pyspark with help of window function

1 Answers1

Creating sample dataframe:

Creating Window and filling nulls: