Information on Apache HIVE

Question

So I am trying to understand HIVE better. I’ve briefly used it in my previous roles but I have never been completely clear on it. My question is:

Is HIVE also used for data storage, or strictly as a means of querying data on the Hadoop cluster? In other words, does data need to be moved to a HIVE database/environment before it can be queried? Or can HIVE be used to query directly into the Hadoop cluster?

Thanks!

hive isnt a DBMS but a tool on top of HDFS to query data from HDFS using sql like statements. so no need to move data if its already in HDFS. — Koushik Roy, Aug 03 '21 at 15:45

leftjoin · Accepted Answer · 2021-08-03T18:13:41.340

Short answer: Hive is not a data storage, it can query data from storage using Tables (schema definition, SerDe used for data serialization/deserialization and data location are defined in create-table statement).

Long answer:

Data are stored in HDFS or other Hadoop compatible filesystem like S3 (can be completely separated from Hadoop cluster).

Hive is a database: it has rich SQL(DDL and DML), metadata which includes statistics and tables definitions, access grants, cost-based optimizer and can use different query engines: MR(MapReduce) and Tez. The difference between Hive and traditional RDBMS is that Hive uses schema-on-read concept: how data are being stored and how it is being read are completely disconnected, schema applied when data is being read, data files can be added by some external process into HDFS.

Hive can read different structured files (like JSON, Avro, CSV, Parquet, ORC, etc) as well as semi-structured files (using RegexSerDe or any other, even custom SerDe). Also Hive can connect to other JDBC sources for easy integration and read/write to them.

In Hive, table or partition is a location in HDFS in which data files are being stored + metadata containing schema definition, SerDe, statistics, access grants.

You can create table on top of some existing location, and even many tables (even with different schema) on top of the same location. Read this answer about multiple tables on top of the same location and this answer: https://stackoverflow.com/a/54242477/2700344 about managed/external tables.

You can put files directly into table location or remove files using HDFS commands and it will be reflected in dataset returned by Hive, also LOAD INTO TABLE command is also supported, it will put files into table location for you and you do not need to know the location path.

@BorisD'Souza Please Accept and/or upvote the answer if you really like it — leftjoin, Aug 20 '21 at 07:57

Information on Apache HIVE

1 Answers1