I have a large table (1Tb of data) that needs to be joined with a smaller table (100k records)
SELECT st.id
FROM small_table st
LEFT JOIN large_table lt
ON st.id = lt.id
In the above scenario, I am not able to control which table has to be cached into memory.I have tried using MAPJOIN , STREAMTABLE hints and also tried with parameters like conditional task size , small tbale size etc. Since the small table is on the left most side of the join, its not being cached into memory
Is there way to control the table that needs to be cached
Note: I cannot alter the table positions or the code:
Neither here...
...nor here...
Parameters used:
set hive.execution.engine=tez;
set hive.tez.container.size=4096;
set hive.merge.mapredfiles=true;
set tez.shuffle-vertex-manager.min-src-fraction=0.25;
set tez.shuffle-vertex-manager.max-src-fraction=0.75;
set hive.exec.dynamic.partition.mode=nonstrict;
set tez.am.resource.memory.mb=3200 ;
set tez.am.java.opts=-server -Xmx3200m -Djava.net.preferIPv4Stack=true -XX:+UseNUMA -XX:+UseParallelGC -XX:+UseConcMarkSweepGC ;
SET hive.auto.convert.join=true;
set hive.auto.convert.join.noconditionaltask.size=288435456;