computing mean and median of a column in pyspark rdd containing missing values

Question

I am using PySpark. The rdd has a column having floating point values, where some of the rows are missing. The missing rows are just empty string ''.

Now, I want to write the mean and median of the column in the place of empty strings, but how do I compute the mean?

Since rdd.mean() function won't work with floating column containing empty strings.

import numpy as np

def replaceEmpty(x):
    if x=='':
        x = np.nan
    return x

def fillNA(x):
    mu = np.nanmean(np.array(x))
    if x==np.nan:
        x = mu
    return x    

data = data.map(lambda x: replaceEmpty(x))    
data = data.map(lambda x: fillNA(x))

But this approach does not really work !

score 0 · Answer 1 · edited May 23 '17 at 12:17

Solved it finally using: Fill Pyspark dataframe column null values with average value from same column

I used sqlContext instead of SparkContext. Previously, I was using:

data = sc.textFile('all_data_col5.txt')

I changed that to:

data = sqlContext.read.format('com.databricks.spark.csv').options(header=True, inferSchema=False).schema(df_schema).load('all_data_col5.csv')

Since, sqlContext seems to have much more functionalities to handle NA values.

computing mean and median of a column in pyspark rdd containing missing values

1 Answers1