14

At my firm, I see these two commands used frequently, and I'd like to be aware of the differences, because their functionality seems the same to me:

1

create table <mytable> 
(name string,
number double);

load data inpath '/directory-path/file.csv' into <mytable>; 

2

create table <mytable>
(name string,
number double);

location '/directory-path/file.csv';

They both copy the data from the directory on HDFS into the directory for the table on HIVE. Are there differences that one should be aware of when using these? Thank you.

makansij
  • 9,303
  • 37
  • 105
  • 183
  • 1
    I believe for your second query the `location '/directory-path/file.csv';` will not work since you are creating an internal table (since you did not explicity specify `create external table` and simply said `create table (name string, number double);`). Therefore, you cannot use location on that internal table since the location for internal tables is already set by the configuration property `hive.metastore.warehouse.dir` and you cannot change it using the `location` keyword. –  Jun 10 '18 at 07:08

2 Answers2

20

Yes, they are used for different purposes at all.

load data inpath command is use to load data into hive table. 'LOCAL' signifies that the input file is on the local file system. If 'LOCAL' is omitted then it looks for the file in HDFS.

load data inpath '/directory-path/file.csv' into <mytable>; 
load data local inpath '/local-directory-path/file.csv' into <mytable>;

LOCATION keyword allows to point to any HDFS location for its storage, rather than being stored in a folder specified by the configuration property hive.metastore.warehouse.dir.

In other words, with specified LOCATION '/your-path/', Hive does not use a default location for this table. This comes in handy if you already have data generated.

Remember, LOCATION can be specified on EXTERNAL tables only. For regular tables, the default location will be used.

To summarize, load data inpath tell hive where to look for input files and LOCATION keyword tells hive where to save output files on HDFS.

References: https://cwiki.apache.org/confluence/display/Hive/GettingStarted https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL

Ahmadov
  • 1,567
  • 5
  • 31
  • 48
Sachin Gaikwad
  • 1,014
  • 7
  • 9
  • 1
    Is it possible to pass a directory to the `load data inpath` command? I've noticed that you pass file names. – makansij Feb 20 '16 at 05:27
  • missing `TABLE` keyword in both command you provided. should be something like `load data inpath '/directory-path/file.csv' into TABLE ` – JenkinsY Aug 29 '19 at 06:34
8

Option 1: Internal table

create table <mytable> 
(name string,
number double);

load data inpath '/directory-path/file.csv' into <mytable>; 

This command will remove content at source directory and create a internal table

Option 2: External table

 create table <mytable>
 (name string,
 number double);

location '/directory-path/file.csv';

Create external table and copy the data into table. Now data won't be moved from source. You can drop external table but still source data is available.

When you drop an external table, it only drops the meta data of HIVE table. Data still exists at HDFS file location.

Have a look at this related SE questions regarding use cases for both internal and external tables

Difference between Hive internal tables and external tables?

Community
  • 1
  • 1
Ravindra babu
  • 37,698
  • 11
  • 250
  • 211
  • 1
    Aha, thank you! I have a better understanding now. So, do you specify a *directory* or a *file* for `location`? I see that you have file path, but I've seen direcotries as well. – makansij Feb 20 '16 at 05:18
  • 1
    Secondly, you do not use the keyword `external` in your "Option 2". Is external table implied when you use the word "location"? – makansij Feb 20 '16 at 05:19
  • I have used only file till now. Not sure about directory. If dirextory is allowed, all lines in files should be in same format. Location implied for external tables only. – Ravindra babu Feb 20 '16 at 06:48