I have a dataframe with a column which is nested StructType. The StructType is deeply nested and may comprise other Structs. Now I want to update this column at the lowest level. I tried withField but it doesn't work if any of the top level struct is null. I will appreciate any help with this.
The example schema is:
val schema = new StructType()
.add("key", StringType)
.add(
"cells",
ArrayType(
new StructType()
.add("family", StringType)
.add("qualifier", StringType)
.add("timestamp", LongType)
.add("nestStruct", new StructType()
.add("id1", LongType)
.add("id2", StringType)
. .add("id3", new StructType()
.add("id31", LongType)
.add("id32", StringType))
)
)
val data = Seq(
Row(
"1235321863",
Array(
Row("a", "b", 1L, null)
)
)
)
val df_test = spark
.createDataFrame(spark.sparkContext.parallelize(data), schema)
val result = df_test.withColumn(
"cell1",
transform($"cells", cell => {
cell.withField("nestStruct.id3.id31", lit(40)) /*This line doesn't do anything is nestStruct is null. */
}))
result.show(false)
result.printSchema
result.explain() /*The physical plan shows that if a field is null it will just return null*/