1

We have a simple query running on a table/view which is approx of size 5 TB. We are performing ETL and finally adding the data to the core table by adding a partition.

But as the data we are processing is huge, the query spawns 4000+ mappers and 1000+ reducers. Query also runs for 40+ mins.

How can I improve /reduce resource utilization?

Query:

insert overwrite table dss.prblm_mtrc partition (LOAD_DT) select *, '2019-01-02' as LOAD_DT from dss.v_prblm_mtrc_stg_etl
halfer
  • 19,824
  • 17
  • 99
  • 186
akash sharma
  • 411
  • 2
  • 24

1 Answers1

1

Use static partition, in case there are already many partitions in target table, Hive will scan them faster before final load, see also this: HIVE Dynamic Partitioning tips

insert overwrite table dss.prblm_mtrc partition (LOAD_DT='2019-01-02') select * from dss.v_prblm_mtrc_stg_etl

If your source table has too many small files, try to reduce them, use concatenate for orc files or use merge during loading source table

Use vectorizing, tune mappers and reducers parallelism: https://stackoverflow.com/a/48487306/2700344

set hive.vectorized.execution.enabled = true;
set hive.vectorized.execution.reduce.enabled = true;

All vectorizing settings are here.

And if you target table contains many already loaded partitions, try to switch off statistics auto gathering, this setting may speed-up loading:

set hive.stats.autogather=false;

Also if your source table is orc and splits calculation takes too long time, try this setting:

hive.exec.orc.split.strategy=BI;

More about ORC strategy: ORC split strategy

leftjoin
  • 36,950
  • 8
  • 57
  • 116