2

I have a Hive table which is partitioned by partitionDate field. I can read partition of my choice via simple

select * from myTable where partitionDate = '2000-01-01'

My task is to specify the partition of my choise dynamically. I.e. first I want to read it from some table, and only then run select to myTable. And of course, I want the power of partitions to be used.

I have written a query which looks like

select * from myTable mt join thatTable tt on tt.reportDate = mt.partitionDate

The query works but looks like partitions are not used. The query works too long.

I tried another approach:

select * from myTable where partitionDate in (select reportDate from thatTable)

.. and again I see that the query works too slowly.

Is there a way to implement this in Hive?

update: create table for myTable

CREATE TABLE `myTable`(            
  `theDate` string,            
 ')            
PARTITIONED BY (           
  `partitionDate` string) 
TBLPROPERTIES (             
  'DO_NOT_UPDATE_STATS'='true',         
  'STATS_GENERATED_VIA_STATS_TASK'='true',                
  'spark.sql.create.version'='2.2 or prior',              
  'spark.sql.sources.schema.numPartCols'='1',    
  'spark.sql.sources.schema.numParts'='2',          
  'spark.sql.sources.schema.part.0'='{"type":"struct","fields":[{"name":"theDate","type":"string","nullable":true}...         
  'spark.sql.sources.schema.part.1'='{"name":"partitionDate","type":"string","nullable":true}...',               
  'spark.sql.sources.schema.partCol.0'='partitionDate')  
leftjoin
  • 36,950
  • 8
  • 57
  • 116
MiamiBeach
  • 3,261
  • 6
  • 28
  • 54
  • can you please run `analyze table tab compute statistics;` and check? It should follow the partition. Unless, the data type of `partitionDate` is `timestamp/date` and the data type of your other table is a `string`. Can you also run `explain elect * from myTable mt join thatTable tt on tt.reportDate = mt.partitionDate` and check if its hitting partition? – Koushik Roy Jul 21 '21 at 07:12
  • @KoushikRoy, I have executed analyze table myTable(partitionDate) compute statistics, but it gave no effect. I double-checked that column types are the same = String. I ran explain and see no evidence of partition used for query select * from myTable where partitionDate in (select reportDate from thatTable) – MiamiBeach Jul 21 '21 at 08:13
  • cab you post `create table ` of mytable. wanted to see if this is partitioned properly. – Koushik Roy Jul 21 '21 at 08:42
  • @KoushikRoy, updated the post. But see, simple query select * from myTable where partitionDate = '2000-01-01' works good, so looks like the issue is not with partitioning, but with the way I specify partition. – MiamiBeach Jul 21 '21 at 09:20
  • or may be gather stats... – Koushik Roy Jul 21 '21 at 11:55

1 Answers1

2

If you are running Hive on Tez execution engine, try

set hive.tez.dynamic.partition.pruning=true;

Read more details and related configuration in the Jira HIVE-7826

and at the same time try to rewrite as a LEFT SEMI JOIN:

select * 
  from myTable t 
       left semi join (select distinct reportDate from thatTable) s on t.partitionDate = s.reportDate 

If nothing helps, see this workaround: https://stackoverflow.com/a/56963448/2700344

Or this one: https://stackoverflow.com/a/53279839/2700344

Similar question: Hive Query is going for full table scan when filtering on the partitions from the results of subquery/joins

leftjoin
  • 36,950
  • 8
  • 57
  • 116
  • I am running on MR, not on TEZ. Also, how can I write results of my query into variable? As far as I understand it is not possible, at least not in simple Hive QL. – MiamiBeach Jul 21 '21 at 13:55
  • @MiamiBeach Only possible way is like in workaround links provided. Not inside the same script. – leftjoin Jul 21 '21 at 14:06