Java heap space OutOfMemoryError while running join query in Spark SQL shell

Question

Here is my cluster configuration:

Master nodes: 1 (16 vCPU, 64 GB memory)

Worker nodes: 2 (total of 64 vCPU, 256 GB memory)

Here is the Hive query I'm trying to run on the Spark SQL shell:

select a.*,b.name as name from (
small_tbl b 
join
(select * 
from large_tbl where date = '2019-01-01') a
on a.id = b.id);

Here is the query execution plan as shown on the Spark UI:

The configuration properties set while launching the shell are as follows:

spark-sql --conf spark.driver.maxResultSize=30g \
--conf spark.broadcast.compress=true \
--conf spark.rdd.compress=true \
--conf spark.memory.offHeap.enabled=true \
--conf spark.memory.offHeap.size=304857600 \
--conf spark.dynamicAllocation.enabled=false \
--conf spark.executor.instances=12 \
--conf spark.executor.memory=16g 
--conf spark.executor.cores=5 \
--conf spark.driver.memory=32g \
--conf spark.yarn.executor.memoryOverhead=512 \
--conf spark.executor.extrajavaoptions=-Xms20g \
--conf spark.executor.heartbeatInterval=30s \
--conf spark.shuffle.io.preferDirectBufs=true \
--conf spark.memory.fraction=0.5

I have tried most of the solutions suggested here and here which is evident in the properties set above. As far as I know it's not a good idea to increase the maxResultSize property on the driver side since datasets may grow beyond driver's memory size and driver shouldn't be used to store data in this scale.

I have executed the query on Tez engine successfully which took around 4 minutes, whereas Spark takes more than 15 mins to execute and terminates abruptly with the lack of heap space issue.

I strongly believe there must be a way to speed up the query execution on Spark. Please suggest me a solution that works for this kind of queries.

try following [this](https://stackoverflow.com/questions/58562013/why-pyspark-jobs-are-dying-out-in-the-middle-of-process-without-any-particular-e/58574186#58574186) answer — Gsquare, Nov 03 '19 at 17:46
I'm no Java expert but I'm puzzled by the combination of `executor.extrajavaoptions=-Xms20g` and `executor.memory=16g` (plus/minus the YARN overhead, off-heap and of course stack+code size...) — Samson Scharfrichter, Nov 03 '19 at 20:40
did you check nulls in both the fileds ?? add this ```--conf "spark.sql.crossJoin.enabled=false"```` in your spark-sql & give a try — Sarath Chandra Vema, Nov 04 '19 at 09:11
@Gsquare Let's validate the solutions against my query code issues: 1.Can't try checkpoint() as i'm on SQL shell 2.Broadcasting shouldn't be an issue.I tried disabling the autoBroadcastJoinThreshold. It carried out SortedMerge join ending up in the same issue 3.Data size in each task is in ideal range Data issues: 1.There isn't much difference in time between 25th,median and 75th percentile of stage execution 2.Input file format:ORC,the large table is partitioned on date column.I tried setting spark.default.parallelism 5 but no luck Please let me know if there's something wrong here — Arjun A J, Nov 04 '19 at 17:43
@SarathChandraVema, I tried running the query with the property you suggested. But no improvement. I got GC related error: GC overhead limit exceeded. By the way, did you mean I need to check for null values in the joining keys on both tables? Please clarify. — Arjun A J, Nov 04 '19 at 18:00
To add few details about the job: input size shown on Spark UI is 6.8 GB. 56 tasks were created. All tasks gets executed successfully. But it gets stuck after that. I guess it's unable to bring the results to driver and thereby to user on CLI. I have increased the memory parameters of driver as well as executor. But I'm clueless where exactly it needs memory. — Arjun A J, Nov 04 '19 at 18:04
After reading your comment, it seems like the nulls issue. Yes, I need the count of nulls in joining keys in both the tables — Sarath Chandra Vema, Nov 05 '19 at 04:36
Hi @SarathChandraVema, I checked joining keys. There are no null values in either of them. — Arjun A J, Nov 05 '19 at 17:50
Can you just try with sample data like 1000 rows from both the tables and try — Sarath Chandra Vema, Nov 06 '19 at 06:36

Java heap space OutOfMemoryError while running join query in Spark SQL shell

0 Answers0