Create hive external table from partitioned parquet files in Azure HDInsights

Question

I have data saved as parquet files in Azure blob storage. Data is partitioned by year, month, day and hour like:

cont/data/year=2017/month=02/day=01/

I want to create external table in Hive using following create statement, which I wrote using this reference.

CREATE EXTERNAL TABLE table_name (uid string, title string, value string) 
PARTITIONED BY (year int, month int, day int) STORED AS PARQUET 
LOCATION 'wasb://cont@storage_name.blob.core.windows.net/data';

This creates table but has no rows when querying. I tried same create statement without PARTITIONED BY clause and that seems to work. So looks like issue is with partitioning.

What am I missing?

David דודו Markovitz · Accepted Answer · 2017-04-19T15:54:03.223

21

After you create the partitioned table, run the following in order to add the directories as partitions

MSCK REPAIR TABLE table_name;

If you have a large number of partitions you might need to set hive.msck.repair.batch.size

When there is a large number of untracked partitions, there is a provision to run MSCK REPAIR TABLE batch wise to avoid OOME (Out of Memory Error). By giving the configured batch size for the property hive.msck.repair.batch.size it can run in the batches internally. The default value of the property is zero, it means it will execute all the partitions at once.

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-RecoverPartitions(MSCKREPAIRTABLE)

Written by the OP:

This will probably fix your issue, however if data is very large, it won't work. See relevant issue here.

As a workaround, there is another way to add partitions to Hive metastore one by one like:

alter table table_name add partition(year=2016, month=10, day=11, hour=11)

We wrote simple script to automate this alter statement and it seems to work for now.

edited Apr 19 '17 at 15:54

answered Apr 11 '17 at 15:50

David דודו Markovitz

42,900
6
64
88

1

Thanks for answer. I just found that statement in one docs. However, I am getting this error when running that: `FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask` So looks like there is some other issue. – chhantyal Apr 11 '17 at 16:13
@chhantyal It might be caused by the metastore schema verification, please refer to https://cwiki.apache.org/confluence/display/Hive/Hive+Schema+Tool#HiveSchemaTool-MetastoreSchemaVerification to check whether the property `hive.metastore.schema.verification` is `true` in `hive-site.xml`. Or it might be caused by SQL Azure locked or deny some operations. – Peter Pan Apr 12 '17 at 08:55
@chhantyal - not enough information. Check logs or run in DEBUG mode. – David דודו Markovitz Apr 12 '17 at 09:03
Try changing the logging to DEBUG – David דודו Markovitz Apr 12 '17 at 13:16
Have you try setting hive.msck.repair.batch.size? – David דודו Markovitz Apr 19 '17 at 15:47
Setting `hive.msck.repair.batch.size` is available only on new version (2.2.0) of Hive, which HDInsight doesn't support yet. Even then there seems to be some issue, which I linked in updated answer. – chhantyal Apr 24 '17 at 07:59

Create hive external table from partitioned parquet files in Azure HDInsights

1 Answers1