How to create DataFrame with NaN, null and None values in some of its cells?

Question

In a coursera course about spark + scala I had a lesson about using fillna and replace functions.
I tried to reproduce it to check how does it work in real life but I have a problem with creating df with values that are meant to be replaced.
I tried to do it with use of json input file and with use of sequence of tuples. In both cases I received exceptions.
Could you please give advice what do I have to do to create DataFrame which contains null / NaN / None (maybe all of them, that would be the best scenario for the learning purpose).

object HowToCreateDfWithNullsOrNaNs
{
  def main(args: Array[String]): Unit =
  {
    fromFile()
  }

  def fromFile(): Unit =
  {
    // input_file.json: { "name": "Tom", "surname": null, "age": 10}
    val rddFromJson: RDD[String] = spark.sparkContext.textFile("src/main/resources/input_file.json")
    import spark.implicits._
    /*
      Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: The number of columns doesn't match.
      Old column names (1): value
      New column names (3): name, surname, age
     */
    rddFromJson.toDF("name", "surname", "age")
  }

  def fromSeq() =
  {
    val tupleSeq: Seq[(String, Any, Int)] = Seq(("Tom", null , 10))
    val rdd = spark.sparkContext.parallelize(tupleSeq)
    /*
      Exception in thread "main" java.lang.ClassNotFoundException: scala.Any
        at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:583)
        at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178)
     */
    import spark.implicits._
    rdd.toDF("name", "surname", "age")
  }
}

score 3 · Accepted Answer · answered Aug 14 '22 at 20:22

Since you want to create DataFrames, I suggest not to struggle with RDD-->DF transformations, and create, well, DataFrames.

For example, modifying your two examples:

// input_file.json: { "name": "Tom", "surname": null, "age": 10}
val df = spark.read.json("src/main/resources/input_file.json")
+---+----+-------+
|age|name|surname|
+---+----+-------+
| 10| Tom|   null|
+---+----+-------+

val columns = Seq("name","surname","age")
val tupleSeq: Seq[(String, String, Int)] = Seq(("Tom", null , 10))
spark.createDataFrame(tupleSeq).toDF(columns:_*)
+----+-------+---+
|name|surname|age|
+----+-------+---+
| Tom|   null| 10|
+----+-------+---+

Note that in the 2nd example the type shouldn't be Any, for the reasons given in this answer:

All fields / columns in a Dataset have to be of known, homogeneous type for which there is an implicit Encoder in the scope. There is simply no place for Any there

Your answer is what I need. After using it I had some issues with fill and replace functions. For example they did not work for me with null, and for None I received exception. But they work for NaN as expected. Anyway question was about DF with null/NaN/None and your suggestion + my fill/replace code works with NaN. Thank you :) — bridgemnc, Aug 16 '22 at 19:29
@bridgemnc, BTW, regarding Any and nulls in dataframes, I think this would also be of interest to you: https://stackoverflow.com/questions/52959980/seqint-int-with-nulls-implicitly-converted-to-dataframe, https://stackoverflow.com/questions/44725222/how-to-create-dataframe-with-nulls-using-todf — qaziqarta, Aug 17 '22 at 13:44

How to create DataFrame with NaN, null and None values in some of its cells?

1 Answers1