0

I'm trying to append two dataframes together with a near identical schema. Unfortunately there exists an array field containing a StructType which differs slightly between the two. One of the dataframes' StructType contains an extra StructField.

Df1

(Invoice,StringType),
(LineItems,
    ArrayType(
        StructType(
            StructField(ItemID,StringType,true),
            StructField(Name,StringType,true)
            )
    ,true)
)

Df2

(Invoice,StringType),
(LineItems,
    ArrayType(
        StructType(
            StructField(ItemID,StringType,true),
            StructField(Name,StringType,true),
            StructField(DiscountRate,DoubleType,true)
            )
    ,true)
)

Using the mergeSchema option still results in an error. How can I merge these two dataframes ?

I'm trying to reach a generalised solution, all the solutions I've come across so far have been case specific.

  • 1
    Does this answer your question? [Spark merge dataframe with mismatching schemas without extra disk IO](https://stackoverflow.com/questions/39869084/spark-merge-dataframe-with-mismatching-schemas-without-extra-disk-io) – M. Yousfi May 12 '23 at 09:10

1 Answers1

1

Given that you have df:

+-------+------------------------------------------------------------+
|Invoice|LineItems                                                   |
+-------+------------------------------------------------------------+
|INV0001|[{001, Item 1, 0.5}, {002, Item 2, 0.1}, {003, Item 3, 0.3}]|
|INV0002|[{004, Item 4, 0.6}, {005, Item 5, 0.2}]                    |
+-------+------------------------------------------------------------+

and df2:

+-------+---------------------------------------------+
|Invoice|LineItems                                    |
+-------+---------------------------------------------+
|INV0001|[{001, Item 1}, {002, Item 2}, {003, Item 3}]|
|INV0002|[{004, Item 4}, {005, Item 5}]               |
+-------+---------------------------------------------+

you could use this statement:

df.union(
  df2
    .withColumn("LineItems",
      expr("transform(LineItems, x -> named_struct('ItemID', x.ItemID, 'Name', x.Name, 'DiscountRate', null))")
    )
)

to get this result:

+-------+---------------------------------------------------------------+
|Invoice|LineItems                                                      |
+-------+---------------------------------------------------------------+
|INV0001|[{001, Item 1, 0.5}, {002, Item 2, 0.1}, {003, Item 3, 0.3}]   |
|INV0002|[{004, Item 4, 0.6}, {005, Item 5, 0.2}]                       |
|INV0001|[{001, Item 1, null}, {002, Item 2, null}, {003, Item 3, null}]|
|INV0002|[{004, Item 4, null}, {005, Item 5, null}]                     |
+-------+---------------------------------------------------------------+

Hope it helps, good luck!

vilalabinot
  • 1,420
  • 4
  • 17