Replace values in a pyspark dataframe

Question

I am new to pyspark and working on my first spark project where I am facing two issues.

a) not able to reference column using

df["col1"].show() 

***TypeError: 'Column' object is not callable***

b) not able to replace values in my spark dataframe with aggregated value like mean

Code:
from pyspark import SparkConf, SparkContext
from pyspark.sql.functions import *
from pyspark.sql import Row, HiveContext, SQLContext, Column
from pyspark.sql.types import *

df = hive_context.table("db_new.temp_table")
df.select("col1").fillna(df.select("col1").mean())

***AttributeError: 'DataFrame' object has no attribute 'mean'***

Any help is greatly appreciated!

Update:

I tried the below code snippet but it is returning another error.

df.withColumn("new_Col", when("ColA".isNull,df.select(mean("ColA"))
  .first()(0).asInstanceOf[Double])
  .otherwise("ColA"))

AttributeError: 'str' object has no attribute 'isNull'

score -2 · Answer 1 · answered Sep 27 '17 at 16:20

-2

This should work:

df[["col1"]].show()

answered Sep 27 '17 at 16:20

ags29

2,621
1
8
14

Thanks! That resolved my first issue. Any suggestions on the 2nd one? – kkumar Sep 27 '17 at 16:30
sorry didn't see the second bit, I think the second part was answered in the comments to your question – ags29 Sep 27 '17 at 16:33

Replace values in a pyspark dataframe

1 Answers1