I start with the following table:
|date | first_cat | second_cat | price_change|
|:--------- | :--------- |: -------- | ----------:|
|30/05/2022 | old | test_2 | 0.94|
|31/08/2022 | old | test_3 | 1.24|
|30/05/2022 | old | test_2 | 0.90|
|31/08/2022 | old | test_3 | 1.44|
|30/05/2022 | new | test_1 | 1.94|
|30/06/2022 | new | test_4 | 0.54|
|31/07/2022 | new | test_5 | 1.94|
|30/06/2022 | new | test_4 | 0.96|
I want to proceed to calculate the product of price_change grouped by date
, first_cat
and second_cat
:
|date | first_cat | second_cat | price_aggr |
|:--------- | :--------- |: -------- | ----------:|
|30/05/2022 | old | test_2 | 0.94*0.9|
|31/08/2022 | old | test_3 | 1.24*1.44|
|30/05/2022 | new | test_1 | 1.94|
|30/06/2022 | new | test_4 | 0.54*0.96|
|31/07/2022 | new | test_5 | 1.94|
I did it with:
SELECT
date,
first_cat,
second_cat
array_join(collect_list(price_change), "*") as price_aggr
FROM my_table
GROUP BY
date,
first_cat,
second_cat
However using that results in having text expression in a table, whereas I would like to have evaluation of that expression, so the desired result is as follows:
|date | first_cat | second_cat | price_aggr |
|:--------- | :--------- |: -------- | ----------:|
|30/05/2022 | old | test_2 | 0.846|
|31/08/2022 | old | test_3 | 1.7856|
|30/05/2022 | new | test_1 | 1.94|
|30/06/2022 | new | test_4 | 0.5184|
|31/07/2022 | new | test_5 | 1.94|
I saw some ideas, but they use 'Pandas' and other methods that fall outside Spark SQL
:
Cumulative product in Spark
I need to do it in Spark SQL
alone, I'd like to avoid conversion to Pandas
and UDFs
.
Many thanks!