1

I'm kinda of new to Hive and Hadoop . I have a query which is taking 10 minutes to complete the query .

Size of the data is 10GB Statistics:Num rows: 4457541 Data size: 1854337449 Basic stats: COMPLETE Column stats: COMPLETE

Partition and Bucketing is done in the table .

How can I improve the below query .

select * fromtbl1 where clmn='Abdul' and loc='IND' and TO_UNIX_TIMESTAMP(ts) > (UNIX_TIMESTAMP() - 5*60*60);
set hive.vectorized.execution.reduce.enabled=true;
set hive.tez.container.size=8192;
set hive.fetch.task.conversion = none;
set mapred.compress.map.output=true;
set mapred.output.compress=true;
set hive.fetch.task.conversion=none;


-----------+--+
|                                                                                                           Explain                                                                                                           |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
| Plan not optimized by CBO.                                                                                                                                                                                                  |
|                                                                                                                                                                                                                             |
| Stage-0                                                                                                                                                                                                                     |
|    Fetch Operator                                                                                                                                                                                                           |
|       limit:-1                                                                                                                                                                                                              |
|       Stage-1                                                                                                                                                                                                               |
|          Map 1                                                                                                                                                                                                              |
|          File Output Operator [FS_2973]                                                                                                                                                                                     |
|             compressed:false                                                                                                                                                                                                |
|             Statistics:Num rows: 49528 Data size: 24516360 Basic stats: COMPLETE Column stats: COMPLETE                                                                                                                     |
|             table:{"input format:":"org.apache.hadoop.mapred.TextInputFormat","output format:":"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat","serde:":"org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe"}  |
|             Select Operator [SEL_2972]                                                                                                                                                                                      |
|                outputColumnNames:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7"]                                                                                                                          |
|                Statistics:Num rows: 49528 Data size: 24516360 Basic stats: COMPLETE Column stats: COMPLETE                                                                                                                  |
|                Filter Operator [FIL_2971]                                                                                                                                                                                   |
|                   predicate:((section = 'xysaa') and (to_unix_timestamp(ts) > (unix_timestamp() - 18000))) (type: boolean)                                                                                               |
|                   Statistics:Num rows: 49528 Data size: 24516360 Basic stats: COMPLETE Column stats: COMPLETE                                                                                                               |
|                   TableScan [TS_2970]                                                                                                                                                                                       |
|                      ACID table:true                                                                                                                                                                                        |
|                      alias:pp                                                                                                                                                                              |
|                      Statistics:Num rows: 4457541 Data size: 1854337449 Basic stats: COMPLETE Column stats: COMPLETE                                                                                                        |
|                                                                                                                                                                                                                             |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+

None of the parameters helped us to resolve the query in shorter period of time .

Varshini
  • 69
  • 10

1 Answers1

0

According to the plan, query runs on mapper, vectorizing is not enabled. Try this:

set hive.vectorized.execution.enabled = true;
set hive.vectorized.execution.reduce.enabled=true;

Tune mapper parallelism:

set tez.grouping.max-size=67108864;
set tez.grouping.min-size=32000000;

Play with these settings to increase the number of mappers running. Ideally it should run without this setting:

set hive.tez.container.size=8192;

One more recommendation is to replace unix_timestamp() with UNIX_TIMESTAMP(current_timestamp). This function is not deterministic and its value is not fixed for the scope of a query execution, therefore prevents proper optimization of queries - this has been deprecated since 2.0 in favor of CURRENT_TIMESTAMP constant.

(UNIX_TIMESTAMP(current_timestamp) - 5*60*60)

Also your files are very small. the size of partition is 200-500, 12 files per partition, 20-50Mb is the file size. Fortunately it is ORC and you can concatenate files using ALTER TABLE CONCATENATE COMMAND. 12 files is not a big deal and you probably will not notice an improvement when querying single partition.

See also this answer: https://stackoverflow.com/a/48487306/2700344

leftjoin
  • 36,950
  • 8
  • 57
  • 116
  • No Improvement Mate . Still the same . I was wondering why it is taking 10 minutes to complete the query when the data is huge in TB also taking same time to return the results . How can I tune this query faster . – Varshini Nov 20 '18 at 16:35
  • @Varshini Please answer these questions: How many files and what is file format and compression in the single partition, what is partitioned column? – leftjoin Nov 20 '18 at 17:15
  • Partitioned column is loc and compression is ZLIB with ORC File format. There are 40-50 partitions. – Varshini Nov 21 '18 at 18:24
  • @Varshini check also please how many files per partition and their size – leftjoin Nov 21 '18 at 18:33
  • There are 12 files in almost all the partitions . Size of every partition would be 200-500 MB and total size of the table is 10GB . – Varshini Nov 21 '18 at 19:25
  • @Varshini It is not too many small files but they are too small. Concatenation may help a little bit. Also added recommendation about CURRENT_TIMESTAMP – leftjoin Nov 21 '18 at 19:55
  • @Varshini Is it because of bucketing you have such small files? Maybe you have some other reason for bucketing... But imho it should be one 500M file instead – leftjoin Nov 21 '18 at 19:59
  • hello @leftjoin, how did u get the size of the partition and files per partition and the size from explain plan. can you please help me on reading the explain plan properly. – akash sharma Nov 23 '18 at 14:11
  • @akashsharma I asked Varshini about sizes. see prev comments – leftjoin Nov 23 '18 at 14:24
  • @leftjoin , Dint workout well , tried without bucketing -- took more minutes tried with increasing the bucketing number to 53 - took more minutes Along with what you have suggested me above.concatenate is also not giving me performance benefits and having some issues when I try to execute the alter with concatenate. – Varshini Nov 30 '18 at 23:17
  • @Varshini have you succeeded in increasing the number of mappers running? – leftjoin Dec 01 '18 at 08:05
  • @leftjoin Yes . It didnt help increasing the performance. I have set mappers to above mentioned size (32000000) – Varshini Dec 01 '18 at 13:19
  • @Varshini Figures in my answer is just an example. There are no universal configuration, change them accordingly to increase parallelism. The number of mappers should increase. And when you did it, has the number of mappers running increased? – leftjoin Dec 01 '18 at 16:17