0

I am encountering the following error when I am running the below Spark-sql query.

org.apache.spark.SparkExecution: Job aborted due to stage failure: Total size of serialized results of 33 tasks (1046.8 MB) is larger than spark.driver.maxResultSize (1024.0 MB)

The source table record counts are:

contrib = 734539064
contrib (with filters 2018) = 683878665
date_dim = 73416

The query is:

select policy,
       order,
       covr_fr,
       covr_to,
       sum(fc_am) - SUM(mc_am) As amount
from (select coalesce(cntr.vre2, cntr.cacont_acc)) as policy,
             cntr.order,
             date_dim_get_sk.date_dim_sk as pst_dt_sk,
             cntr.covr_fr,
             cntr.covr_to
      from (select * from contrib
            where ind_cd = 'BP'
            and flg IN ('001', '004'
            and covr_fr > '2018-01-01' ) cntr
JOIN date_dim ON date_dim.dt = cntr.pstng_dt
JOIN date_dim_get_sk ON date_dim_get_sk.dt = date_dim.dt_lst_dayofmon
GROUP BY policy,
         order,
         covr_fr,
         covr_to
HAVING sum(fc_am) - SUM(mc_am) > 0

Currently this query is failing with the aforementioned error. I have tried to cache the contrib table but in vain. Can anyone please help me fix the above error and tune this query and make it workable. Please let me know if additional information is required.

Thanks

marie20
  • 723
  • 11
  • 30

1 Answers1

0

Have you looked at this topic?

It's hard to suggest something only upon SQL query.
As the simplest workable solution try to increase spark.driver.maxResultSize

config("spark.driver.maxResultSize", "0")

0 stands for unlimited

Gelerion
  • 1,634
  • 10
  • 17
  • I am still encountering the same issue even with the setting you recommended. Also, when i use the ```limit``` clause (e.g. ```limit 100```) in the sql, I get the results back successfully indicating the size of the resultset is the culprit here. Also, if you happen to know any source where I can get info on how to optimize explain plan, pls let me know. – marie20 Aug 06 '19 at 16:29