1

There are some good questions/answers here

Hive clustered by on more than one column

hive subquery optimization using cluster by

difference between Cluster By and CLUSTERED BY in hive?

What is the difference between partitioning and bucketing a table in Hive ?

but I have a few more, unfortunately there is no good explanation here on page 24: https://docs.hortonworks.com/HDPDocuments/HDP3/HDP-3.0.1/using-hiveql/hive_using_hiveql.pdf

My questions:

In below example from the above:

CREATE TABLE pageviews (userid VARCHAR(64), link STRING, from STRING)
 PARTITIONED BY (datestamp STRING) CLUSTERED BY (userid) INTO 256 BUCKETS;

 INSERT INTO TABLE pageviews PARTITION (datestamp = '2014-09-23') VALUES
  ('jsmith', 'mail.com', 'sports.com'), ('jdoe', 'mail.com', null);

  INSERT INTO TABLE pageviews PARTITION (datestamp) VALUES ('tjohnson',
  'sports.com', 'finance.com', '2014-09-23'), ('tlee', 'finance.com', null,
  '2014-09-21'); 

why does "datestamp STRING" do not exist in the the schema of the pageviews?

Why is it defined as string? should not be TIMESTAMP?

Why does the second insert miss it and only has it as type but it has as values (i.e. '2014-09-23' and '2014-09-21?

Vadim Kotov
  • 8,084
  • 8
  • 48
  • 62
Mohsen Sichani
  • 1,002
  • 12
  • 33

1 Answers1

1

why does "datestamp STRING" do not exist in the the schema of the pageviews?

Although datestamp looks and behaves like a standard column defined in the schema, it's actually just a reference to a particular partition of the underlying data for the table. When you see '2014-09-23' in the datestamp column, it's not actually showing you a value contained in a particular record in one of the data files, instead it's telling you that the data in the rest of the row comes from an HDFS directory called 'datestamp=2014-09-23' that contains a partition or "chunk" of the data. This is were a lot of the optimization comes in, since filtering a query to a particular partition allows Hive to simply go to the data in that particular directory and ignore the data contained in the other n number of partitions.

Why is it defined as string? should be TIMESTAMP?

Since a partition is simply referring to a directory name, it only makes sense that the type is a string representation of a specific date format instead of a timestamp or date. Conceptually, a date field would not make sense since although '2014-09-23' and '9/23/2014' are two equal datestamps, these would be considered different directories if they were directory names. In other words, if a directory is named '2014-09-23', you cannot refer to it by any other name making it more like a string and less like a date which has many alternate forms that are all equivalent. Furthermore, Hive already treats dates as strings which makes it a better solution than say, a type of int. For example if you pass in a timestamp to Hive's to_date() user defined function, it returns the date as a string.

Also, since you mentioned timestamp, using a full timestamp that has fractions of a second in it is a bad idea for partitions, even if you use a string representation of it. You would end up with a massive amount of partitions and probably one or at most only a few records in each partition. I would imagine you would quickly lose any of the performance benefits of partitioning.

Why does the second insert miss it and only has it as type but it has as values (i.e. '2014-09-23' and '2014-09-21?

This is simply a different syntax that produces the same result. When you include partitions, Hive will assume the values at the end of the values array refer to the partitions. So if you have a table with 3 columns in your schema and 1 partition, when you perform an insert into table command and specify partition (datestamp), you can just pass in 4 values and Hive will know that the first 3 values are to be inserted into the 3 columns in your schema, and the fourth value refers to which datestamp partition you want to add this record's data to.

samredai
  • 662
  • 3
  • 8
  • Nice explanation, just a question, regarding the third point, if partitioning is done by two columns (say a,b), so I can practically has (att1, att2, att3, a,b)? should it be always be like this? I guess none of these two are possible (att1, att2, att3, a)? or (att1, att2, att3, b) – Mohsen Sichani Aug 29 '19 at 18:56
  • 1
    @mohsenhs Great question. When you have multiple partitions, the partitioning (and the directories) will be hierarchical, based on order (a,b). In terms of behavior, it helps to keep the directory structure in mind. In your example, the **b** partitions will be sub-folders of the **a** partitions, i.e. __a='part1'/b='subpart1'__ & __a='part1'/b='subpart2', so if you are inserting data, you can see that you would need to include both partitions since it needs to know which **a** partition folder to go into, and then which **b** subpartition folder within that **a** folder to place the record. – samredai Aug 29 '19 at 19:53