I've got a question on doing an aggregation on an array of nested JSON. I have the sample order dataframe or (shown as JSON) as below:
{
"orderId": "oi1",
"orderLines": [
{
"productId": "p1",
"quantity": 1,
"sequence": 1,
"totalPrice": {
"gross": 50,
"net": 40,
"tax": 10
}
},
{
"productId": "p2",
"quantity": 3,
"sequence": 2,
"totalPrice": {
"gross": 300,
"net": 240,
"tax": 60
}
}
]
}
How using Spark SQL to 'sum the quantities across all lines for a given order'?
e.g in this case 1 + 3 = 4
I'd like to write below but there is no equiv like built-in function supported it would appear (unless Ive missed it which could be likely!)
SELECT
orderId,
sum_array(orderLines.quantity) as totalQuantityItems
FROM
orders
Maybe a custom UDF (Scala) is needed? What would this look like if so / any examples? Even going further into the nesting, sum the total items
SELECT
orderId,
sum_array(orderLines.totalPrice.net) as totalOrderNet
FROM
orders