Hive query performance for high cardinality field

Question

I have a single but huge table in hive which will almost always be queried with the primary key column (say, employee_id). The table will be really huge, millions of rows inserted each day and I want to query fast using partitions over this field. I followed this post and I know that partitioning is only good for low cardinality fields, so how can I accomplish my goal of querying fast with employee_id column?

I understand that id column having very high cardinality should be used as bucketing but it does not help me with the query performance over single table, does it?

I think that if I could use something like hash(employee_id) as partitions, it would help me very much. Is this possible? I couldn't see such a thing in the documents about hive.

To summarize, what I want is fast query result for:

select * from employee where employee_id=XXX

assuming employee table has billions of records, with primary key column employee_id where classical partitioning by year, month, day etc does not help.

Thanks in advance,

You could use a prefix of the unique ID to reduce cardinality — OneCricketeer, Jan 17 '18 at 15:04

score 3 · Answer 1 · edited Jan 11 '21 at 15:30

3

Use ORC with bloom filters:

    CREATE TABLE employee (
      employee_id bigint,
      name STRING
    ) STORED AS ORC 
    TBLPROPERTIES ("orc.bloom.filter.columns"="employee_id")
    ;

Enable PPD with vectorizing, use CBO and Tez:

    SET hive.optimize.ppd=true;
    SET hive.optimize.ppd.storage=true;
    SET hive.vectorized.execution.enabled=true;
    SET hive.vectorized.execution.reduce.enabled = true;
    SET hive.cbo.enable=true;
    set hive.stats.autogather=true;
    set hive.compute.query.using.stats=true;
    set hive.stats.fetch.partition.stats=true;
    set hive.execution.engine=tez;
    set hive.stats.fetch.column.stats=true;
    set hive.map.aggr=true;
    SET hive.tez.auto.reducer.parallelism=true;

Ref: https://community.cloudera.com/t5/Community-Articles/Optimizing-Hive-queries-for-ORC-formatted-tables/ta-p/248164

Tune proper parallelism on mappers and reducers:

--example for mappers:

 set tez.grouping.max-size=67108864;
 set tez.grouping.min-size=32000000;

--example settings for reducers:

 set hive.exec.reducers.bytes.per.reducer=67108864; --decrease this to increase the number of reducers

Change these figures to achieve optimal performance.

edited Jan 11 '21 at 15:30

user1

391
3
27

answered Jan 17 '18 at 08:25

leftjoin

36,950
8
57
116

Thank you for your answer. I will try this and see the performance. Thanks again. – Mehmet Ali Karabulut Jan 17 '18 at 08:34
another table property to add would be ```'orc.compress'='SNAPPY'``` – Gabe Jan 17 '18 at 14:30
Unfortunately, no. We stopped using Hive, but instead HBase table which can be accessed very fast via primary index and also added Phoenix layer on it to have SQL comfort by creating views for existing tables. – Mehmet Ali Karabulut Jun 13 '18 at 10:21
1

For a description of each command at step 2 - https://community.cloudera.com/t5/Community-Articles/Optimizing-Hive-queries-for-ORC-formatted-tables/ta-p/248164 – user1 Jan 11 '21 at 15:20

score 1 · Answer 2 · answered Jan 12 '21 at 05:49

1

You can try to bucket the hive table. The buckets are based on the high-cardinality field. This is the concept of hash.

answered Jan 12 '21 at 05:49

juhengzhe

36
3

Hive query performance for high cardinality field

2 Answers2

Linked