I am currently trying to understand the concept of Salt
-ing to counter Skew
. Unfortunately, I couldn't find enough information which would help me wrap my head around the concept of Salting in the context of aggregations (E.g. Group By
& Window
etc) in Spark SQL.
So far I have construed that salted aggregations require 2 passes. Hence, I put together the following code snippet that would represent the first pass. However, I am unable to proceed from there on. Can someone please help me proceed with few examples using Spark SQL queries ?
Pass I:
create temporary view salt1
as
select cust, item, cast(rand() * 10 as int) as salt
from tab1;
create temporary view salt2
as
select cust, item
from (select cust,
item,
row_number() over (partition by salt, cust order by purch) as row_num
from salt1
)
where row_num = 1;
Thanks for any help.