What is the status of schema evolution for arrays
of structs
(complex types) in spark?
I know that for either ORC or Parquet for regular simple types works rather fine (adding a new column) but I could not find any documentation so far for my desired case.
My use case is to have a structure similar to this one:
user_id,date,[{event_time, foo, bar, baz, tag1, tag2, ... future_tag_n}, ...]
And I want to be able to add new fields to the struct in the array.
Would a Map
(key-value) complex type instead cause any inefficiencies? There I would at least be sure that adding new fields (tags) would be flexible.
edit
case class BarFirst(baz:Int, foo:String)
case class BarSecond(baz:Int, foo:String, moreColumns:Int, oneMore:String)
case class BarSecondNullable(baz:Int, foo:String, moreColumns:Option[Int], oneMore:Option[String])
case class Foo(i:Int, date:String, events:Seq[BarFirst])
case class FooSecond(i:Int, date:String, events:Seq[BarSecond])
case class FooSecondNullable(i:Int, date:String, events:Seq[BarSecondNullable])
val dfInitial = Seq(Foo(1, "2019-01-01", Seq(BarFirst(1, "asdf")))).toDF
dfInitial.printSchema
dfInitial.show
root
|-- i: integer (nullable = false)
|-- date: string (nullable = true)
|-- events: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- baz: integer (nullable = false)
| | |-- foo: string (nullable = true)
scala> dfInitial.show
+---+----------+----------+
| i| date| events|
+---+----------+----------+
| 1|2019-01-01|[[1,asdf]]|
+---+----------+----------+
dfInitial.write.partitionBy("date").parquet("my_df.parquet")
tree my_df.parquet
my_df.parquet
├── _SUCCESS
└── date=2019-01-01
└── part-00000-fd77f730-6539-4b51-b680-b7dd5ffc04f4.c000.snappy.parquet
val evolved = Seq(FooSecond(2, "2019-01-02", Seq(BarSecond(1, "asdf", 11, "oneMore")))).toDF
evolved.printSchema
evolved.show
scala> evolved.printSchema
root
|-- i: integer (nullable = false)
|-- date: string (nullable = true)
|-- events: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- baz: integer (nullable = false)
| | |-- foo: string (nullable = true)
| | |-- moreColumns: integer (nullable = false)
| | |-- oneMore: string (nullable = true)
scala> evolved.show
+---+----------+--------------------+
| i| date| events|
+---+----------+--------------------+
| 1|2019-01-02|[[1,asdf,11,oneMo...|
+---+----------+--------------------+
import org.apache.spark.sql._
evolved.write.mode(SaveMode.Append).partitionBy("date").parquet("my_df.parquet")
my_df.parquet
├── _SUCCESS
├── date=2019-01-01
│ └── part-00000-fd77f730-6539-4b51-b680-b7dd5ffc04f4.c000.snappy.parquet
└── date=2019-01-02
└── part-00000-64e65d05-3f33-430e-af66-f1f82c23c155.c000.snappy.parquet
val df = spark.read.parquet("my_df.parquet")
df.printSchema
scala> df.printSchema
root
|-- i: integer (nullable = true)
|-- events: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- baz: integer (nullable = true)
| | |-- foo: string (nullable = true)
|-- date: date (nullable = true)
additional columns are missing! Why?
df.show
df.as[FooSecond].collect // AnalysisException: No such struct field moreColumns in baz, foo
df.as[FooSecondNullable].collect // AnalysisException: No such struct field moreColumns in baz, foo
This behaviour was evaluated for spark 2.2.3_2.11 and 2.4.2_2.12.