I am using PySpark. The rdd has a column having floating point values, where some of the rows are missing. The missing rows are just empty string ''.
Now, I want to write the mean and median of the column in the place of empty strings, but how do I compute the mean?
Since rdd.mean() function won't work with floating column containing empty strings.
import numpy as np
def replaceEmpty(x):
if x=='':
x = np.nan
return x
def fillNA(x):
mu = np.nanmean(np.array(x))
if x==np.nan:
x = mu
return x
data = data.map(lambda x: replaceEmpty(x))
data = data.map(lambda x: fillNA(x))
But this approach does not really work !