union two dataframes with nested different schemas

Question

Dataframe1 looks like this

root
 |-- source: string (nullable = true)
 |-- results: array (nullable = true)
 |    |-- content: struct (containsNull = true)
 |    |    |-- ptype: string (nullable = true)
 |    |    |-- domain: string (nullable = true)
 |    |    |-- verb: string (nullable = true)
 |    |    |-- foobar: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- fooId: integer (nullable = true)
 |-- date: string (nullable = false)
 |-- hour: string (nullable = false)

Dataframe 2 look like below:

root
 |-- source: string (nullable = true)
 |-- results: array (nullable = true)
 |    |-- content: struct (containsNull = true)
 |    |    |-- ptype: string (nullable = true)
 |    |    |-- domain: string (nullable = true)
 |    |    |-- verb: string (nullable = true)
 |    |    |-- foobar: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |-- date: string (nullable = false)
 |-- hour: string (nullable = false)

Notice the differnce - there is no fooId in the second dataframe. How can I union these two dataframes together? I understand that the two schemas need to be the same to union. What is the best way to add fooId or remove fooId?(non trivial because of the structure of the schema) What is the recommended approach for doing union of this kind. Thanks

To use `union` the schema of the two dataframes need to match. So either remove the `fooId` column in the first dataframe or add it (as null or any constant value) to the second dataframe. — Shaido, Nov 27 '18 at 02:15
@Shaido I have edited the question. I understand it needs to be the same. — suprita shankar, Nov 27 '18 at 02:43
@Shaido I want to understand how do folks handle this case without changing the inherent structure much. — suprita shankar, Nov 27 '18 at 02:44
To add a nested column to dataframe 2, see here https://stackoverflow.com/questions/44831789/spark-sql-nested-withcolumn. Add fooId with all null values, then you can Union the two — sramalingam24, Nov 27 '18 at 06:18
The above link is to add a nested column to a struct. How will you do this for arrays? it is not accessible with the `.` operator — suprita shankar, Nov 28 '18 at 19:20

score 0 · Answer 1 · answered Nov 27 '18 at 09:09

0

As you considered two Dataframes let DF1 and DF2, You could remove the extra column in the DF1 and run a untion of both the dataframes

// this is to remove the extra column in the dataframe
DF1.drop("fooId")

Now both the DFs has the same number of columns so you can do a union

DF1.union(DF2)

answered Nov 27 '18 at 09:09

Sundeep Pidugu

2,377
2
21
43

1

This will not work - as `drop` will not drop nested columns. – suprita shankar Nov 28 '18 at 19:18

score 0 · Answer 2 · edited Aug 07 '23 at 08:48

0

try to extract all nested key-value pairs of df1 and then use drop:

df = df1.select(col("results.content.*"))
#this will give you all column names
#Then drop a specific column
df = df.drop("fooId")

edited Aug 07 '23 at 08:48

Koedlt

4,286
8
15
33

answered Aug 01 '23 at 09:21

mansigaur

1

score 0 · Answer 3 · answered Aug 01 '23 at 09:26

0

This should work with using unionByName and using allowMissingColumns to True.

answered Aug 01 '23 at 09:26

partlov

13,789
6
63
82

union two dataframes with nested different schemas

3 Answers3