I am encountering the following error when I am running the below Spark-sql query.
org.apache.spark.SparkExecution: Job aborted due to stage failure: Total size of serialized results of 33 tasks (1046.8 MB) is larger than spark.driver.maxResultSize (1024.0 MB)
The source table record counts are:
contrib = 734539064
contrib (with filters 2018) = 683878665
date_dim = 73416
The query is:
select policy,
order,
covr_fr,
covr_to,
sum(fc_am) - SUM(mc_am) As amount
from (select coalesce(cntr.vre2, cntr.cacont_acc)) as policy,
cntr.order,
date_dim_get_sk.date_dim_sk as pst_dt_sk,
cntr.covr_fr,
cntr.covr_to
from (select * from contrib
where ind_cd = 'BP'
and flg IN ('001', '004'
and covr_fr > '2018-01-01' ) cntr
JOIN date_dim ON date_dim.dt = cntr.pstng_dt
JOIN date_dim_get_sk ON date_dim_get_sk.dt = date_dim.dt_lst_dayofmon
GROUP BY policy,
order,
covr_fr,
covr_to
HAVING sum(fc_am) - SUM(mc_am) > 0
Currently this query is failing with the aforementioned error. I have tried to cache the contrib
table but in vain.
Can anyone please help me fix the above error and tune this query and make it workable. Please let me know if additional information is required.
Thanks