2

I have some questions which I faced recently in the interview with a company. As I am a newbie in Hadoop, can anyone please tell me the right answers?

Questions:

  1. Difference between "Sort By" and "Group by" in Hive. How they work?
  2. If we use the "Limit 1" in any SQL query in Hive, will Reducer work or not.
  3. How to optimize Hive Performance?
  4. Difference between "Internal Table" and "External Table"
  5. What is the main difference between Hive and SQL

Please provide me few useful resources, so that I can learn in the better way. Thanks

PradeepKumbhar
  • 3,361
  • 1
  • 18
  • 31
Ankur Kumar
  • 51
  • 1
  • 1
  • 7

5 Answers5

6

PFB the answers:

1. Difference between "Sort By" and "Group by" in Hive. How they work?

Ans. SORT BY sorts the data per reducer, it provides ordering of the rows within a reducer. If there are more than one reducer, "sort by" may give partially ordered final results. Whereas GROUP BY aggregate records by the specified columns which allows you to perform aggregation functions on non-grouped columns (such as SUM, COUNT, AVG, etc).

2. If we use the "Limit 1" in any SQL query in Hive, will Reducer work or not.

Ans. I think Reducer will work, because as per Hive documentation -- Limit indicates the number of rows to be returned. The rows returned are chosen at random. The following query returns 5 rows from t1 at random.

SELECT * FROM t1 LIMIT 5

Having to randomly pick, it has to have complete result output from Reducer.

- How to optimize Hive Performance?

Ans. These links should answer this

- Difference between "Internal Table" and "External Table"

Ans. "Internal Table" also known as Managed Table, is the one that is managed by Hive. When you point data in HDFS to such table, the data is moved to Hive default location /ust/hive/warehouse/. And, then if such internal table is dropped, the data is deleted along with.

"External table" on the other hand is user managed, and data is not moved to hive default directory after loading i.e, any custom location can be specified. Consecutively, when you drop such table, no data is deleted, only table schema is dropped.

- What is the main difference between Hive and SQL

Ans. Hive is a Datawarehousing layer on top of hadoop that provides SQL like row table interface to users for analyzing underlying data. It employs HiveQL (HQL) language for this which is loosely based on SQL-92 standards.

SQL is a standard RDBMS language for accessing and manipulating databases.

Shakti Tokas
  • 61
  • 1
  • 4
1

I am new to Hadoop and Hive as well so I can't give you a complete answer.

From what I've read in the book "Hadoop The Definitive Guide" the key difference between Hive and SQL is that Hive (HiveQL) was created with MapReduce in mind. Hive's SQL dialect is supposed to make it easier for people to interact with Hadoop without needing to know a lot about Java (and SQL is well known by data professionals anyway).

As time has went on, Hive has become more compliant to the SQL standard. It blends a mix of MySQL and Oracle's SQL dialects with SQL-92.

The Main Difference

From what I've read, the biggest difference is that RDBMS have schema's that are typically schema on write. This means that data needs to conform to the schema when you load it in the database. In Hive, it uses schema on read because it doesn't verify the data when it is loaded.

Information obtained from Hadoop The Definitive Guide

Really good book and gives a good overview of all the technologies involved.

EDIT:

For external and internal tables, check out this response:

Difference between Hive internal tables and external tables?

Information regarding Sort By and Group By

Sort By:

Hive uses the columns in SORT BY to sort the rows before feeding the rows to a reducer. The sort order will be dependent on the column types. If the column is of numeric type, then the sort order is also in numeric order. If the column is of string type, then the sort order will be lexicographical order.

Difference between Sort By and Order By

(Taken from the link provided maybe this will help with the difference between Group By and Sort By)

Hive supports SORT BY which sorts the data per reducer. The difference between "order by" and "sort by" is that the former guarantees total order in the output while the latter only guarantees ordering of the rows within a reducer. If there are more than one reducer, "sort by" may give partially ordered final results.

Note: It may be confusing as to the difference between SORT BY alone of a single column and CLUSTER BY. The difference is that CLUSTER BY partitions by the field and SORT BY if there are multiple reducers partitions randomly in order to distribute data (and load) uniformly across the reducers.

Basically, the data in each reducer will be sorted according to the order that the user specified.

Group By:

Group By is done using aggregation. It is pretty much done the same as you would normally in any other SQL dialect.

    INSERT OVERWRITE TABLE pv_gender_sum
    SELECT pv_users.gender, count (DISTINCT pv_users.userid)
    FROM pv_users
    GROUP BY pv_users.gender;

This query selects pv_users.gender and counts the distinct user_ids from the users table. In order to do count the users in a gender, you would first have to group all the users who are a certain gender together. (Query taken from the group by link below)

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SortBy

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+GroupBy

Information on Optimizing Hive Performance

http://hortonworks.com/blog/5-ways-make-hive-queries-run-faster/

Optimizing Joins

https://www.facebook.com/notes/facebook-engineering/join-optimization-in-apache-hive/470667928919/

General Hive Performance Tips

https://streever.atlassian.net/wiki/display/HADOOP/Hive+Performance+Tips

Some extra resources

SQL to Hive Cheat Sheet

http://hortonworks.com/wp-content/uploads/downloads/2013/08/Hortonworks.CheatSheet.SQLtoHive.pdf

Hive LIMIT Documentation

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Select#LanguageManualSelect-LIMITClause

Best of luck in your interview!

Community
  • 1
  • 1
K.Boyette
  • 343
  • 1
  • 11
1

From Hive 0.10.0 the simple select statement, such as select column_name from table name LIMIT n,can avoid map reduce if task conversation hive.fetch.task.conversion=more is set

deepak
  • 43
  • 3
1

1. Difference between "Sort By" and "Group by" in Hive. How they work?

  • SORT BY : It sorts the result within each reducers defined for the Map reduce job. It's not necessary that the output would be in a sorted order but the output coming from each reducer would be in order. Check example below! I ran it in 11 node cluster.

enter image description here

  • GROUP BY : It helps in aggregation of the data. sum() , count() , avg() , max() , min() , collect_list() , collect_set() all uses group by. It's like clubbing the result based on same features. Example : There is a state column and population column and we are aggregating on the basis of states , then there would be 29 distinct values with sum(population).

2. If we use the "Limit 1" in any SQL query in Hive, will Reducer work or not.

  • select * from db.table limit 1 : statement never includes reducers , you can check by using explain statement. enter image description here

  • select * from db.table order by column : uses reducers or whenever there is an aggregation. Check below screenshot.enter image description here

3. How to optimize Hive Performance?

  • Using Tez session

  • Using bucketing and Partitioning

  • Using Orc file format

  • Using vectorisation

  • Using CBO 4. Difference between "Internal Table" and "External Table"

  • Internal table : Both metadata and data stored in the hive. If one deletes the table, automatically entire schema and data would be deleted.

  • External table : Only metadata is handled by hive. Data is handled by user. If one deletes the table , only schema will be deleted, data remains intact. For creation of external table , one needs to use external keyword in create statement and also needs to specify the location where data is put.

5. What is the main difference between Hive and SQL

  • Hive is a data warehouse tool designed to process structured data on hadoop while SQL is used process structured data on RDBMS.
Tutu Kumari
  • 485
  • 4
  • 10
-2

Reducer will not run if we use limit in select clause.

select * from table_name limit 5;

sukesh
  • 1
  • 1