This code below I understand and was helpful.
But I would like to make this a generic approach, but cannot actually get started, and think that it is not possible actually with the case
statement. I am looking at another approach, but am interested if a generic approach is also possible here.
import spark.implicits._
import org.apache.spark.sql.Encoders
// Creating case classes with the schema of your json objects. We're making
// these to make use of strongly typed Datasets. Notice that the MyChgClass has
// each field as an Option: this will enable us to choose between "chg" and
// "before"
case class MyChgClass(b: Option[String], c: Option[String], d: Option[String])
case class MyFullClass(k: Int, b: String, c: String, d: String)
case class MyEndClass(id: Int, after: MyFullClass)
// Creating schemas for the from_json function
val chgSchema = Encoders.product[MyChgClass].schema
val beforeSchema = Encoders.product[MyFullClass].schema
// Your dataframe from the example
val df = Seq(
(1, """{"b": "new", "c": "new"}""", """{"k": 1, "b": "old", "c": "old", "d": "old"}""" ),
(2, """{"b": "new", "d": "new"}""", """{"k": 2, "b": "old", "c": "old", "d": "old"}""" )
).toDF("id", "chg", "before")
// Parsing the json string into our case classes and finishing by creating a
// strongly typed dataset with the .as[] method
val parsedDf = df
.withColumn("parsedChg",from_json(col("chg"), chgSchema))
.withColumn("parsedBefore",from_json(col("before"), beforeSchema))
.drop("chg")
.drop("before")
.as[(Int, MyChgClass, MyFullClass)]
// Mapping over our dataset with a lot of control of exactly what we want. Since
// the "chg" fields are options, we can use the getOrElse method to choose
// between either the "chg" field or the "before" field
val output = parsedDf.map{
case (id, chg, before) => {
MyEndClass(id, MyFullClass(
before.k,
chg.b.getOrElse(before.b),
chg.c.getOrElse(before.c),
chg.d.getOrElse(before.d)
))
}
}
output.show(false)
parsedDf.printSchema()
We have many such situations, but with differing payload. I can get the fields of the case class, but cannot see the forest for the trees how to make this generic. E,g, [T] type approach for the below. I am wondering if this can be done in fact?
I can get a List of attributes, and am wondering if something like attrList.map(x => ...)
with substitution can be used for the chg.b
etc?
val output = parsedDf.map{
case (id, chg, before) => {
MyEndClass(id, MyFullClass(
before.k,
chg.b.getOrElse(before.b),
chg.c.getOrElse(before.c),
chg.d.getOrElse(before.d)
))
}
}