1

Dataframe1 looks like this

root
 |-- source: string (nullable = true)
 |-- results: array (nullable = true)
 |    |-- content: struct (containsNull = true)
 |    |    |-- ptype: string (nullable = true)
 |    |    |-- domain: string (nullable = true)
 |    |    |-- verb: string (nullable = true)
 |    |    |-- foobar: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- fooId: integer (nullable = true)
 |-- date: string (nullable = false)
 |-- hour: string (nullable = false)

Dataframe 2 look like below:

root
 |-- source: string (nullable = true)
 |-- results: array (nullable = true)
 |    |-- content: struct (containsNull = true)
 |    |    |-- ptype: string (nullable = true)
 |    |    |-- domain: string (nullable = true)
 |    |    |-- verb: string (nullable = true)
 |    |    |-- foobar: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |-- date: string (nullable = false)
 |-- hour: string (nullable = false)

Notice the differnce - there is no fooId in the second dataframe. How can I union these two dataframes together? I understand that the two schemas need to be the same to union. What is the best way to add fooId or remove fooId?(non trivial because of the structure of the schema) What is the recommended approach for doing union of this kind. Thanks

suprita shankar
  • 1,554
  • 2
  • 16
  • 47
  • 1
    To use `union` the schema of the two dataframes need to match. So either remove the `fooId` column in the first dataframe or add it (as null or any constant value) to the second dataframe. – Shaido Nov 27 '18 at 02:15
  • @Shaido I have edited the question. I understand it needs to be the same. – suprita shankar Nov 27 '18 at 02:43
  • @Shaido I want to understand how do folks handle this case without changing the inherent structure much. – suprita shankar Nov 27 '18 at 02:44
  • To add a nested column to dataframe 2, see here https://stackoverflow.com/questions/44831789/spark-sql-nested-withcolumn. Add fooId with all null values, then you can Union the two – sramalingam24 Nov 27 '18 at 06:18
  • The above link is to add a nested column to a struct. How will you do this for arrays? it is not accessible with the `.` operator – suprita shankar Nov 28 '18 at 19:20

3 Answers3

0

As you considered two Dataframes let DF1 and DF2, You could remove the extra column in the DF1 and run a untion of both the dataframes

// this is to remove the extra column in the dataframe
DF1.drop("fooId")

Now both the DFs has the same number of columns so you can do a union

DF1.union(DF2)
Sundeep Pidugu
  • 2,377
  • 2
  • 21
  • 43
0

try to extract all nested key-value pairs of df1 and then use drop:

df = df1.select(col("results.content.*"))
#this will give you all column names
#Then drop a specific column
df = df.drop("fooId")
Koedlt
  • 4,286
  • 8
  • 15
  • 33
0

This should work with using unionByName and using allowMissingColumns to True.

partlov
  • 13,789
  • 6
  • 63
  • 82