0

On using PARTITIONED BY or CLUSTERED BY keywords while creating Hive tables, hive would create separate files corresponding to each partition or bucket. But for external tables is this still valid. As my understanding is data files corresponding to external files are not managed by hive. So does hive create additional files corresponding to each partition or bucket and move corresponding data in to these files.

Edit - Adding details.
Few extracts from "Hadoop: Definitive Guide" - "Chapter 17: Hive"
CREATE TABLE logs (ts BIGINT, line STRING) PARTITIONED BY (dt STRING, country STRING);

When we load data into a partitioned table, the partition values are specified explicitly:

LOAD DATA LOCAL INPATH 'input/hive/partitions/file1' INTO TABLE logs PARTITION (dt='2001-01-01', country='GB');

At the filesystem level, partitions are simply nested sub directories of the table directory. After loading a few more files into the logs table, the directory structure might look like this:

PartitionedTable_HiveWarehouseDirectoryListing

The above table was obviously a managed table, so hive had the ownership of data and created a directory structure for each partition as in the above tree structure.

In case of external table
CREATE EXTERNAL TABLE logs (ts BIGINT, line STRING) PARTITIONED BY (dt STRING, country STRING);
Followed by same set of load operations -
LOAD DATA LOCAL INPATH 'input/hive/partitions/file1' INTO TABLE logs PARTITION (dt='2001-01-01', country='GB');

How will hive handle these partitions. As for external tables with out partition, hive will simply point to the data file and fetch any query result by parsing the data file. But in case of loading data in to a partitioned external table, where are the partitions created.

Hope fully in hive warehouse? Can someone support or clarify this?

user4157124
  • 2,809
  • 13
  • 27
  • 42

2 Answers2

0

Suppose partitioning on date as this is a common thing to do.

CREATE EXTERNAL TABLE mydatabase.mytable (
    var1   double
  , var2   INT
  , date   String
)
PARTITIONED BY (date String)
LOCATION '/user/location/wanted/';

Then add all your partitions;

ALTER TABLE mytable ADD PARTITION( date = '2017-07-27' );
ALTER TABLE mytable ADD PARTITION( date = '2017-07-28' );

So on and so forth.

Finally you can add your data in the proper location. You will have an external partitioned file.

invoketheshell
  • 3,819
  • 2
  • 20
  • 35
  • my understanding was Hive will not have ownership of data files which are referred by external tables. So which process will partition the data file (In case of managed tables, Hive will partition the data while its loading the data file and save them in it's warehouse directory. ). Could you explain the process in bit more detail. Sorry but I just get half of the picture. –  Jun 27 '17 at 20:53
  • You can create an external table that is partitioned. Using Alter table and add partition places the file structure in hdfs and hive for the partitions. Then you can -put your data into the proper location. Often there are automated processes that will place batch data into a partition every so often during the day. Then if someone decides to drop table the data won't be lost. You can build a partitioned table as shown above. – invoketheshell Jun 27 '17 at 20:58
  • might be the question needed bit more explanation. So added it. –  Jun 29 '17 at 13:43
  • it's better to run "hive msck repair table tablename" if possible, so it adds all the missing partitions automatically – blade Dec 08 '19 at 21:29
0

There is an easy way to do this. Create your External Hive table first.

CREATE EXTERNAL TABLE database.table (
    id integer,
    name string
)
PARTITIONED BY (country String)
LOCATION 'xxxx';

Next you have to run a MSCK command (metastore consistency check)

 msck repair table database.table

This command will recover all partitions that are available in your path and update the metastore. Now, if you run your query against your table, data from all partitions will be retrieved.

Tharun Kumar
  • 363
  • 3
  • 5