Hive: read table partitions defined in subselect

Question

I have a Hive table which is partitioned by partitionDate field. I can read partition of my choice via simple

select * from myTable where partitionDate = '2000-01-01'

My task is to specify the partition of my choise dynamically. I.e. first I want to read it from some table, and only then run select to myTable. And of course, I want the power of partitions to be used.

I have written a query which looks like

select * from myTable mt join thatTable tt on tt.reportDate = mt.partitionDate

The query works but looks like partitions are not used. The query works too long.

I tried another approach:

select * from myTable where partitionDate in (select reportDate from thatTable)

.. and again I see that the query works too slowly.

Is there a way to implement this in Hive?

update: create table for myTable

CREATE TABLE `myTable`(            
  `theDate` string,            
 ')            
PARTITIONED BY (           
  `partitionDate` string) 
TBLPROPERTIES (             
  'DO_NOT_UPDATE_STATS'='true',         
  'STATS_GENERATED_VIA_STATS_TASK'='true',                
  'spark.sql.create.version'='2.2 or prior',              
  'spark.sql.sources.schema.numPartCols'='1',    
  'spark.sql.sources.schema.numParts'='2',          
  'spark.sql.sources.schema.part.0'='{"type":"struct","fields":[{"name":"theDate","type":"string","nullable":true}...         
  'spark.sql.sources.schema.part.1'='{"name":"partitionDate","type":"string","nullable":true}...',               
  'spark.sql.sources.schema.partCol.0'='partitionDate')

can you please run `analyze table tab compute statistics;` and check? It should follow the partition. Unless, the data type of `partitionDate` is `timestamp/date` and the data type of your other table is a `string`. Can you also run `explain elect * from myTable mt join thatTable tt on tt.reportDate = mt.partitionDate` and check if its hitting partition? — Koushik Roy, Jul 21 '21 at 07:12
@KoushikRoy, I have executed analyze table myTable(partitionDate) compute statistics, but it gave no effect. I double-checked that column types are the same = String. I ran explain and see no evidence of partition used for query select * from myTable where partitionDate in (select reportDate from thatTable) — MiamiBeach, Jul 21 '21 at 08:13
cab you post `create table ` of mytable. wanted to see if this is partitioned properly. — Koushik Roy, Jul 21 '21 at 08:42
@KoushikRoy, updated the post. But see, simple query select * from myTable where partitionDate = '2000-01-01' works good, so looks like the issue is not with partitioning, but with the way I specify partition. — MiamiBeach, Jul 21 '21 at 09:20

leftjoin · Accepted Answer · 2021-07-21T20:15:45.817

2

If you are running Hive on Tez execution engine, try

set hive.tez.dynamic.partition.pruning=true;

Read more details and related configuration in the Jira HIVE-7826

and at the same time try to rewrite as a LEFT SEMI JOIN:

select * 
  from myTable t 
       left semi join (select distinct reportDate from thatTable) s on t.partitionDate = s.reportDate

If nothing helps, see this workaround: https://stackoverflow.com/a/56963448/2700344

Or this one: https://stackoverflow.com/a/53279839/2700344

Similar question: Hive Query is going for full table scan when filtering on the partitions from the results of subquery/joins

edited Jul 21 '21 at 20:15

answered Jul 21 '21 at 12:00

leftjoin

36,950
8
57
116

I am running on MR, not on TEZ. Also, how can I write results of my query into variable? As far as I understand it is not possible, at least not in simple Hive QL. – MiamiBeach Jul 21 '21 at 13:55
@MiamiBeach Only possible way is like in workaround links provided. Not inside the same script. – leftjoin Jul 21 '21 at 14:06

Hive: read table partitions defined in subselect

1 Answers1

Linked