1

When I write the hive query like below

select count(*)
from order
where order_month >= '2016-11';

Hadoop job information for Stage-1: number of mappers: 5; number of reducers: 1

I am getting 5 mappers only it means reading required partitions only(2016-11 and 2016-12)

Same query I write using function

select count(*)
from order
where order_month >= concat(year(DATE_SUB(to_date(from_unixtime(UNIX_TIMESTAMP())),10)),'-',month(DATE_SUB(to_date(from_unixtime(UNIX_TIMESTAMP())),10)));

Note:

concat(year(DATE_SUB(to_date(from_unixtime(UNIX_TIMESTAMP())),10)),'-',month(DATE_SUB(to_date(from_unixtime(UNIX_TIMESTAMP())),10))) = '2016-11'

Hadoop job information for Stage-1: number of mappers: 216; number of reducers: 1

this time it is reading all partitions {i.e. 2004-10 to 2016-12}. .

How to modify the query to read required partitions only.

Iulian Dogariu
  • 510
  • 4
  • 10
Kranthi
  • 11
  • 1
  • 1
    Possible duplicate of [Hive partition pruning on computed column](http://stackoverflow.com/questions/35782664/hive-partition-pruning-on-computed-column) – Iulian Dogariu Dec 10 '16 at 20:43
  • See also this question http://stackoverflow.com/questions/33369488/hive-query-runs-very-slowly-when-using-variables-in-where-condition/33384954 – Iulian Dogariu Dec 10 '16 at 20:46

1 Answers1

0

unix_timestamp() function is non-deterministic and prevents proper optimization of queries - this has been deprecated since 2.0 in favour of CURRENT_TIMESTAMP and CURRENT_DATE.

Use current_date, also no need to calculate year and month separately:

where order_month >= substr(date_sub(current_date, 10),1,7)
leftjoin
  • 36,950
  • 8
  • 57
  • 116