cast schema of a data frame in Spark and Scala

Question

I want to cast the schema of a dataframe to change the type of some columns using Spark and Scala.

Specifically I am trying to use as[U] function whose description reads: "Returns a new Dataset where each record has been mapped on to the specified type. The method used to map columns depend on the type of U"

In principle this is exactly what I want, but I cannot get it to work.

Here is a simple example taken from https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala



    // definition of data
    val data = Seq(("a", 1), ("b", 2)).toDF("a", "b")

As expected the schema of data is:


    root
     |-- a: string (nullable = true)
     |-- b: integer (nullable = false)

I would like to cast the column "b" to Double. So I try the following:



    import session.implicits._;

    println(" --------------------------- Casting using (String Double)")

    val data_TupleCast=data.as[(String, Double)]
    data_TupleCast.show()
    data_TupleCast.printSchema()

    println(" --------------------------- Casting using ClassData_Double")

    case class ClassData_Double(a: String, b: Double)

    val data_ClassCast= data.as[ClassData_Double]
    data_ClassCast.show()
    data_ClassCast.printSchema()

As I understand the definition of as[u], the new DataFrames should have the following schema


    root
     |-- a: string (nullable = true)
     |-- b: double (nullable = false)

But the output is


     --------------------------- Casting using (String Double)
    +---+---+
    |  a|  b|
    +---+---+
    |  a|  1|
    |  b|  2|
    +---+---+

    root
     |-- a: string (nullable = true)
     |-- b: integer (nullable = false)

     --------------------------- Casting using ClassData_Double
    +---+---+
    |  a|  b|
    +---+---+
    |  a|  1|
    |  b|  2|
    +---+---+

    root
     |-- a: string (nullable = true)
     |-- b: integer (nullable = false)

which shows that column "b" has not been cast to double.

Any hints on what I am doing wrong?

BTW: I am aware of the previous post "How to change column types in Spark SQL's DataFrame?" (see How to change column types in Spark SQL's DataFrame?). I know I can change the type of columns one at a time, but I am looking for a more general solution that changes the schema of the whole data in one shot (and I am trying to understand Spark in the process).

I don't think you can - the `as[U]` API does not change the actual types, it just provides a typed API for handling the dataset; `U` must match the actual types, and changing the actual types can only be done via transformations such as `Column.cast` as explained in the question you linked. — Tzach Zohar, Oct 25 '16 at 06:28

score 5 · Answer 1 · answered Oct 25 '16 at 08:04

5

Well, since functions are chained and Spark does lazy evaluation, it actually does change the schema of the whole data in one shot, even if you do write it as changing one column at the time like this:

import spark.implicits._

df.withColumn("x", 'x.cast(DoubleType)).withColumn("y", 'y.cast(StringType))...

As an alternative, I'm thinking you could use map to do your cast in one go, like:

df.map{t => (t._1, t._2.asInstanceOf[Double], t._3.asInstanceOf[], ...)}

answered Oct 25 '16 at 08:04

Glennie Helles Sindholt

12,816
5
44
50

Thanks for the suggestions, They are very much along other solutions that I found on SO. Also, they are very much in line with the solutions I am going to adopt. Still, it would be very convenient to use as[u]. I am also wondering whether I should do a trick with Encoders, but then I do not understand why if one use (String, Int) then Encoders are not needed. – Massimo Paolucci Oct 25 '16 at 19:52

cast schema of a data frame in Spark and Scala

1 Answers1