I am a newbie and trying to take a large (1.25 TB uncompressed) hdfs file and put it into a Hive managed table. It is already on HDFS in csv format (from sqoop) with an arbitrary partition and I am putting it into a more organized format for querying and joining. I'm on HDP 3.0 using Tez. Here is my hql
:
USE MYDB;
DROP TABLE IF EXISTS new_table;
CREATE TABLE IF NOT EXISTS new_table (
svcpt_id VARCHAR(20),
usage_value FLOAT,
read_time SMALLINT)
PARTITIONED BY (read_date INT)
CLUSTERED BY (svcpt_id) INTO 9600 BUCKETS
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS ORC
TBLPROPERTIES("orc.compress"="snappy");
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
SET hive.exec.max.dynamic.partitions.pernode=2000;
SET hive.exec.max.dynamic.partitions=10000;
SET hive.vectorized.execution.enabled = true;
SET hive.vectorized.execution.reduce.enabled = true;
SET hive.enforce.bucketing = true;
SET mapred.reduce.tasks = 10000;
INSERT OVERWRITE TABLE new_table
PARTITION (read_date)
SELECT svcpt_id, usage, read_time, read_date
FROM raw_table;
The way Tez sets this up is (from my most recent failure):
--------------------------------------------------------------------------------
VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
--------------------------------------------------------------------------------
Map 1 SUCCEEDED 1043 1043 0 0 0 0
Reducer 2 RUNNING 9600 735 19 8846 0 0
Reducer 3 INITED 10000 0 0 10000 0 0
--------------------------------------------------------------------------------
VERTICES: 01/03 [==>>------------------------] 8% ELAPSED TIME: 45152.59 s
--------------------------------------------------------------------------------
I've been working on this for a while. At first I could not get the first map 1
vertex to run so I added in buckets. 96 buckets got the first mapper to run, but the reducer 2
failed citing issues with disk space that did not make sense. Then I upped the number of buckets to 9600 and reduce tasks to 10000 and the reduce 2
vertex started running, albeit slowly. This morning I found it had errored out, because my namenode had shut down due to a java heap space error with the garbage collector.
Does anyone have any guiding advice for me? I feel like I'm shooting in the dark with the number of reduce tasks, number of buckets, and all the configs shown below.
hive.tez.container.size = 5120MB
hive.exec.reducers.bytes.per.reducer = 1GB
hive.exec.max.dynamic.partitions = 5000
hive.optimize.sort.dynamic.partition = FALSE
hive.vectorized.execution.enabled = TRUE
hive.vectorized.execution.reduce.enabled = TRUE
yarn.scheduler.minimum-allocation-mb = 2G
yarn.scheduler.maximum-allocation-mb = 8G
mapred.min.split.size=?
mapred.max.split.size=?
hive.input.format=?
mapred.min.split.size=?
Have not set up LLAP
My cluster has 4 nodes, 32 cores, and 120 GB of memory. I have not used more than 1/3 of the the cluster's storage.