Questions tagged [hive]

Apache Hive is a database built on top of Hadoop and facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible distributed file system. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. Please DO NOT use this tag for flutter database which is also named Hive, use flutter-hive tag instead.

Apache Hive is a database built on top of Hadoop that provides the following:

  • Tools to enable easy data summarization (ETL)
  • Ad-hoc querying and analysis of large datasets data stored in Hadoop file system (HDFS)
  • A mechanism to put structure on this data
  • An advanced query language called Hive Query Language which is based on SQL and some additional features such as DISTRIBUTE BY, TRANSFORM, and which enables users familiar with SQL to query this data.

At the same time, this language also allows traditional map/reduce programmers the ability to plug in their custom mappers and reducers to do more sophisticated analysis that may not be supported by the built-in capabilities of the language.

Since Hive is Hadoop-based, it does not and cannot promise low latencies on queries. The paradigm here is strictly of submitting jobs and being notified when the jobs are completed as opposed to real-time queries. In contrast to the systems such as Oracle where analysis is run on a significantly smaller amount of data, but the analysis proceeds much more iteratively with the response times between iterations being less than a few minutes, Hive queries response times for even the smallest jobs can be of the order of several minutes. However for larger jobs (e.g., jobs processing terabytes of data) in general they may run into hours and days. Many optimizations and improvements were made to spped-up processing such as fetch-only task, LLAP, materialized views, etc

To summarize, while low latency performance is not the top-priority of Hive's design principles, the following are Hive's key features:

  • Scalability (scale out with more machines added dynamically to the Hadoop cluster)
  • Extensibility (with map/reduce framework and UDF/UDAF/UDTF)
  • Fault-tolerance
  • Loose-coupling with its input formats
  • Rather reach query kanguage with native suport for JSON, XML, regexp, possibility to call java methods, using python and shell transformations, analytics and windowing functions, possibility to connect to different RDBMS using JDBC drivers, Kafka connector.
  • Ability to read and write almost any file formats using native and third-party SerDe, RegexSerDe.
  • Numerous third-party extensions, for example brickhouse UDFs, etc

How to write good Hive question:

  1. Add clear textual problem description.
  2. Provide query and/or table DDL if applicable
  3. Provide exception message
  4. Provide input and desired output data example
  5. Questions about query performance should include EXPLAIN query output.
  6. Do not use pictures for SQL, DDL, DML, data examples, EXPLAIN output and exception messages.
  7. Use proper code and text formatting

Official links:

Useful Links:

21846 questions
258
votes
19 answers

Difference between Pig and Hive? Why have both?

My background - 4 weeks old in the Hadoop world. Dabbled a bit in Hive, Pig and Hadoop using Cloudera's Hadoop VM. Have read Google's paper on Map-Reduce and GFS (PDF link). I understand that- Pig's language Pig Latin is a shift from(suits the way…
Arnkrishn
  • 29,828
  • 40
  • 114
  • 128
202
votes
17 answers

When to use Hadoop, HBase, Hive and Pig?

What are the benefits of using either Hadoop or HBase or Hive ? From my understanding, HBase avoids using map-reduce and has a column oriented storage on top of HDFS. Hive is a sql-like interface for Hadoop and HBase. I would also like to know how…
Khalefa
  • 2,294
  • 3
  • 14
  • 12
155
votes
9 answers

What is the difference between partitioning and bucketing a table in Hive ?

I know both is performed on a column in the table but how is each operation different.
NishM
  • 1,706
  • 2
  • 15
  • 26
124
votes
19 answers

Difference between Hive internal tables and external tables?

Can anyone tell me the difference between Hive's external table and internal tables. I know the difference comes when dropping the table. I don't understand what you mean by the data and metadata is deleted in internal and only metadata is deleted…
NJ_315
  • 1,863
  • 7
  • 22
  • 30
123
votes
6 answers

Difference between INNER JOIN and LEFT SEMI JOIN

What is the difference between an INNER JOIN and LEFT SEMI JOIN? In the scenario below, why am I getting two different results? The INNER JOIN result set is a lot larger. Can someone explain? I am trying to get the names within table_1 that only…
user3023355
  • 1,257
  • 2
  • 9
  • 6
115
votes
10 answers

How to set variables in HIVE scripts

I'm looking for the SQL equivalent of SET varname = value in Hive QL I know I can do something like this: SET CURRENT_DATE = '2012-09-16'; SELECT * FROM foo WHERE day >= @CURRENT_DATE But then I get this error: character '@' not supported here
user1678312
  • 1,309
  • 3
  • 10
  • 11
111
votes
4 answers

How to get/generate the create statement for an existing hive table?

Assuming you have "table" already in Hive, is there a quick way like other databases to be able to get the "CREATE" statement for that table?
Rolando
  • 58,640
  • 98
  • 266
  • 407
105
votes
11 answers

How to save DataFrame directly to Hive?

Is it possible to save DataFrame in spark directly to Hive? I have tried with converting DataFrame to Rdd and then saving as a text file and then loading in hive. But I am wondering if I can directly save dataframe to hive
Gourav
  • 1,245
  • 2
  • 10
  • 12
102
votes
25 answers

How to know Hive and Hadoop versions from command prompt?

How can I find which Hive version I am using from the command prompt. Below is the details- I am using Putty to connect to hive table and access records in the tables. So what I did is- I opened Putty and in the host name I typed-…
arsenal
  • 23,366
  • 85
  • 225
  • 331
95
votes
6 answers

Parquet vs ORC vs ORC with Snappy

I am running a few tests on the storage formats available with Hive and using Parquet and ORC as major options. I included ORC once with default compression and once with Snappy. I have read many a documents that state Parquet to be better in…
Rahul
  • 2,354
  • 3
  • 21
  • 30
89
votes
6 answers

How to Update/Drop a Hive Partition?

After adding a partition to an external table in Hive, how can I update/drop it?
darcyy
  • 5,236
  • 5
  • 28
  • 41
85
votes
18 answers

How do I output the results of a HiveQL query to CSV?

we would like to put the results of a Hive query to a CSV file. I thought the command should look like this: insert overwrite directory '/home/output.csv' select books from table; When I run it, it says it completeld successfully but I can never…
AAA
  • 2,388
  • 9
  • 32
  • 47
79
votes
7 answers

Is there any way to get the column name along with the output while execute any query in Hive?

In Hive, when we do a query (like: select * from employee), we do not get any column names in the output (like name, age, salary that we would get in RDBMS SQL), we only get the values. Is there any way to get the column names to be displayed along…
Nithin
  • 9,661
  • 14
  • 44
  • 67
79
votes
6 answers

How to select current date in Hive SQL

How do we get the current system date in Hive? In MySQL we have select now(), can any one please help me to get the query results. I am very new to Hive, is there a proper documentation for Hive that gives the details information about the pseudo…
Elingela
  • 819
  • 1
  • 6
  • 4
79
votes
11 answers

What is Hive: Return Code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask

I am getting: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask While trying to make a copy of a partitioned table using the commands in the hive console: CREATE TABLE copy_table_name LIKE table_name; INSERT…
nickponline
  • 25,354
  • 32
  • 99
  • 167
1
2 3
99 100