I have two dataframes like below
Df1
+----------------------+---------+
|products |visitorId|
+----------------------+---------+
|[[i1,0.68], [i2,0.42]]|v1 |
|[[i1,0.78], [i3,0.11]]|v2 |
+----------------------+---------+
Df2
+---+----------+
| id| name|
+---+----------+
| i1|Nike Shoes|
| i2| Umbrella|
| i3| Jeans|
+---+----------+
Here is the schema of the dataframe Df1
root
|-- products: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: string (nullable = true)
| | |-- interest: double (nullable = true)
|-- visitorId: string (nullable = true)
I want to join the 2 dataframes so that the output is
+------------------------------------------+---------+
|products |visitorId|
+------------------------------------------+---------+
|[[i1,0.68,Nike Shoes], [i2,0.42,Umbrella]]|v1 |
|[[i1,0.78,Nike Shoes], [i3,0.11,Jeans]] |v2 |
+------------------------------------------+---------+
Here is the schema of the output that I am expecting
root
|-- products: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: string (nullable = true)
| | |-- interest: double (nullable = true)
| | |-- name: double (nullable = true)
|-- visitorId: string (nullable = true)
How do I do it in Scala? I am using Spark 2.2.0.
Update
I did the explode and join on the above dataframes and got the below output.
+---------+---+--------+----------+
|visitorId| id|interest| name|
+---------+---+--------+----------+
| v1| i1| 0.68|Nike Shoes|
| v1| i2| 0.42| Umbrella|
| v2| i1| 0.78|Nike Shoes|
| v2| i3| 0.11| Jeans|
+---------+---+--------+----------+
Now, I just need the above dataframe in the below json format.
{
"visitorId": "v1",
"products": [{
"id": "i1",
"name": "Nike Shoes",
"interest": 0.68
}, {
"id": "i2",
"name": "Umbrella",
"interest": 0.42
}]
},
{
"visitorId": "v2",
"products": [{
"id": "i1",
"name": "Nike Shoes",
"interest": 0.78
}, {
"id": "i3",
"name": "Jeans",
"interest": 0.11
}]
}