0

I have a csv file with name and value as my column. Both are in string format.

dummy.csv:
Jordan  20|
  Mike   NA|
  James   30|
  Steve   NA|
   Stella   20|
   David   NA

Schema:

root
 name: string (nullable = true)
 value: string (nullable = true)

I'm trying to replace "NA" values with average value of that particular column. I'm able to calculate the average,however I have an issue replacing "NA" values with mean

dummmyCol=['value']
dummydf.select([round(mean(col(c)),2).alias(c) for c in dummmyCol]).show()

+-----+
|value|
+-----+
|23.33|
+-----+

The below code is what I attempted to replace NA values. I know the below code is flawed. Any help would be greatly appreciated. Thanks

dummydf.select([when(col(c1)=='NA',dummydf.select(round(mean(col(c1)),2))).alias(c1) for c1 in dummmyCol])

Expected output should be:

 Jordan  20|
  Mike   23.3|
  James   30|
  Steve   23.3|
   Stella   20|
   David   23.3
  • you can see this post: https://stackoverflow.com/questions/40057563/replace-missing-values-with-mean-spark-dataframe – firsni Sep 26 '19 at 11:56
  • I did go through this. But I'm unwilling to use ml features for this question since I have no idea what imputer does. Are there any chances I can get this done with my approach in python? – Karthick Rajasekaran Sep 26 '19 at 14:17

0 Answers0