Partioning :
Partioning is decomposing/dividing your input data based on some condition e.g: Date, Country here.
CREATE TABLE logs (ts BIGINT, line STRING)
PARTITIONED BY (dt STRING, country STRING);
LOAD DATA LOCAL INPATH 'input/hive/partitions/file1'
INTO TABLE logs PARTITION (dt='2012-01-01', country='GB');
Files created in warehouse as below after loading data:
/user/hive/warehouse/logs/dt=2012-01-01/country=GB/file1/
/user/hive/warehouse/logs/dt=2012-01-01/country=GB/file2/
/user/hive/warehouse/logs/dt=2012-01-01/country=US/file3/
/user/hive/warehouse/logs/dt=2012-01-02/country=GB/file4/
/user/hive/warehouse/logs/dt=2012-01-02/country=US/file5/
/user/hive/warehouse/logs/dt=2012-01-02/country=US/file6
SELECT ts, dt, line
FROM logs
WHERE country='GB';
This query will only scan file1, file2 and file4.
Bucketing :
Bucketing is further Decomposing/dividing your input data based on some other conditions.
There are two reasons why we might want to organize our tables (or partitions) into buckets.
The first is to enable more efficient queries. Bucketing imposes extra structure on the table, which Hive can take advantage of when performing certain queries. In particular, a join of two tables that are bucketed on the same columns – which include the join columns – can be efficiently implemented as a map-side join.
The second reason to bucket a table is to make sampling more efficient. When working with large datasets, it is very convenient to try out queries on a fraction of your dataset while you are in the process of developing or refining them.
Let’s see how to tell Hive that a table should be bucketed. We use the CLUSTERED BY clause to specify the columns to bucket on and the number of buckets:
CREATE TABLE student (rollNo INT, name STRING) CLUSTERED BY (id) INTO 4 BUCKETS;
SELECT * FROM student TABLESAMPLE(BUCKET 1 OUT OF 4 ON rand());