Scala 2.10 here using Spark 1.6.2. I have a similar (but not the same) question as this one, however, the accepted answer is not an SSCCE and assumes a certain amount of "upfront knowledge" about Spark; and therefore I can't reproduce it or make sense of it. More importantly, that question is also just limited to adding a new column to an existing dataframe, whereas I need to add a column as well as a value for all existing rows in the dataframe.
So I want to add a column to an existing Spark DataFrame, and then apply an initial ('default') value for that new column to all rows.
val json : String = """{ "x": true, "y": "not true" }"""
val rdd = sparkContext.parallelize(Seq(json))
val jsonDF = sqlContext.read.json(rdd)
jsonDF.show()
When I run that I get this following as output (via .show()
):
+----+--------+
| x| y|
+----+--------+
|true|not true|
+----+--------+
Now I want to add a new field to jsonDF
, after it's created and without modifying the json
string, such that the resultant DF would look like this:
+----+--------+----+
| x| y| z|
+----+--------+----+
|true|not true| red|
+----+--------+----+
Meaning, I want to add a new "z
" column to the DF, of type StringType
, and then default all rows to contain a z
-value of "red"
.
From that other question I have pieced the following pseudo-code together:
val json : String = """{ "x": true, "y": "not true" }"""
val rdd = sparkContext.parallelize(Seq(json))
val jsonDF = sqlContext.read.json(rdd)
//jsonDF.show()
val newDF = jsonDF.withColumn("z", jsonDF("col") + 1)
newDF.show()
But when I run this, I get a compiler error on that .withColumn(...)
method:
org.apache.spark.sql.AnalysisException: Cannot resolve column name "col" among (x, y);
at org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:152)
at org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:152)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:151)
at org.apache.spark.sql.DataFrame.col(DataFrame.scala:664)
at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:652)
I also don't see any API methods that would allow me to set "red"
as the default value. Any ideas as to where I'm going awry?