Here is my cluster configuration:
Master nodes: 1 (16 vCPU, 64 GB memory)
Worker nodes: 2 (total of 64 vCPU, 256 GB memory)
Here is the Hive query I'm trying to run on the Spark SQL shell:
select a.*,b.name as name from (
small_tbl b
join
(select *
from large_tbl where date = '2019-01-01') a
on a.id = b.id);
Here is the query execution plan as shown on the Spark UI:
The configuration properties set while launching the shell are as follows:
spark-sql --conf spark.driver.maxResultSize=30g \
--conf spark.broadcast.compress=true \
--conf spark.rdd.compress=true \
--conf spark.memory.offHeap.enabled=true \
--conf spark.memory.offHeap.size=304857600 \
--conf spark.dynamicAllocation.enabled=false \
--conf spark.executor.instances=12 \
--conf spark.executor.memory=16g
--conf spark.executor.cores=5 \
--conf spark.driver.memory=32g \
--conf spark.yarn.executor.memoryOverhead=512 \
--conf spark.executor.extrajavaoptions=-Xms20g \
--conf spark.executor.heartbeatInterval=30s \
--conf spark.shuffle.io.preferDirectBufs=true \
--conf spark.memory.fraction=0.5
I have tried most of the solutions suggested here and here which is evident in the properties set above. As far as I know it's not a good idea to increase the maxResultSize property on the driver side since datasets may grow beyond driver's memory size and driver shouldn't be used to store data in this scale.
I have executed the query on Tez engine successfully which took around 4 minutes, whereas Spark takes more than 15 mins to execute and terminates abruptly with the lack of heap space issue.
I strongly believe there must be a way to speed up the query execution on Spark. Please suggest me a solution that works for this kind of queries.