In a coursera course about spark + scala I had a lesson about using fillna and replace functions.
I tried to reproduce it to check how does it work in real life but I have a problem with creating df with values that are meant to be replaced.
I tried to do it with use of json input file and with use of sequence of tuples. In both cases I received exceptions.
Could you please give advice what do I have to do to create DataFrame which contains null / NaN / None (maybe all of them, that would be the best scenario for the learning purpose).
object HowToCreateDfWithNullsOrNaNs
{
def main(args: Array[String]): Unit =
{
fromFile()
}
def fromFile(): Unit =
{
// input_file.json: { "name": "Tom", "surname": null, "age": 10}
val rddFromJson: RDD[String] = spark.sparkContext.textFile("src/main/resources/input_file.json")
import spark.implicits._
/*
Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: The number of columns doesn't match.
Old column names (1): value
New column names (3): name, surname, age
*/
rddFromJson.toDF("name", "surname", "age")
}
def fromSeq() =
{
val tupleSeq: Seq[(String, Any, Int)] = Seq(("Tom", null , 10))
val rdd = spark.sparkContext.parallelize(tupleSeq)
/*
Exception in thread "main" java.lang.ClassNotFoundException: scala.Any
at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:583)
at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178)
*/
import spark.implicits._
rdd.toDF("name", "surname", "age")
}
}