2

I'm using Hive to process my CSV files. I've stored CSV files in HDFS and wanna create tables from those files.

I use the following command:

create external table if not exists csv_table (dummy STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION 'hdfs://localhost:9000/user/hive'
TBLPROPERTIES ("skip.header.line.count"="1");

LOAD DATA INPATH '/CsvData/csv_table.csv' OVERWRITE INTO TABLE csv_table;

So the file under /CsvData will be moved into /user/hive. It makes sense.

But how if I want to create another table?

create external table if not exists csv_table2 (dummy STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION 'hdfs://localhost:9000/user/hive'
TBLPROPERTIES ("skip.header.line.count"="1");

LOAD DATA INPATH '/CsvData/csv_table2.csv' OVERWRITE INTO TABLE csv_table2;

It will raise an exception complaining that the directory is not empty.

ERROR : FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MoveTask. Directory hdfs://localhost:9000/user/hive could not be cleaned up.

So it is hard for me to understand, does it mean I can store only one file understand one directory? To store multiple files I have to create one directory for every file?

Is it possible to store all the files together?

leftjoin
  • 36,950
  • 8
  • 57
  • 116
Sraw
  • 18,892
  • 11
  • 54
  • 87

1 Answers1

2

Create table sentence will NOT raise an exception complaining that the directory is not empty because it is quite normal scenario when you create table on top of existing directory.

You can store as many files in the directory as necessary. And all of them will be accessible to the table built on top of the folder.

Table location is directory, not file. If you need to create new table and keep it's files not mixed with other table then create separate folder.

Read also this answer for clear understanding: https://stackoverflow.com/a/54038932/2700344

leftjoin
  • 36,950
  • 8
  • 57
  • 116
  • Um, yes it doesn't throw on creating but on loading. My bad, fixed. Please see my update. – Sraw Jul 02 '19 at 06:31
  • BTW, it might because I'm creating external tables instead of managed ones. As you can see, the failed task is moving which means the hive is trying to move the file to the destination. But I have to create external ones as I don't want to link the data with the schema. – Sraw Jul 02 '19 at 06:33
  • create table does not move files. Load data - moves. And you using OVERWRITE. If the OVERWRITE keyword is used then the contents of the target table (or partition) will be deleted and replaced by the files referred to by filepath; otherwise the files referred by filepath will be added to the table – leftjoin Jul 02 '19 at 06:45
  • Thanks for the clarification. I understand what you've said. But it still doesn't solve the problem that loading data to a not empty directory causes errors. – Sraw Jul 02 '19 at 06:52
  • @Sraw Why do you want to create two tables on top of the same folder? Such tables will share the same data, does not matter external or managed tables. – leftjoin Jul 02 '19 at 06:52
  • @Sraw Do not use hdfs://localhost:9000/user/hive folder. List it's content and you will see. There other table subfolders are being stored. Use your own table subfolder instead. Like this: hdfs://localhost:9000/user/hive/csv_table2 – leftjoin Jul 02 '19 at 06:55
  • Well, actually I just want to store the files together instead of in a lot of subdirectories. Or do you mean one directory can only "contain"(Based on the fact that the schema is stored in metadata database, the directory doesn't really contian the table but only the data.) one table? – Sraw Jul 02 '19 at 06:55
  • @Sraw table=directory. Load command moves files into location specified in table DDL. table=directory. Alternatively you can simply copy these files into table directory using cp command – leftjoin Jul 02 '19 at 06:57
  • @Sraw You should keep different tables in it's own directories, not in single one, because they will share the same data. and do not use root warehouse directory if you do not want to destroy all your other tables – leftjoin Jul 02 '19 at 06:59
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/195842/discussion-between-leftjoin-and-sraw). – leftjoin Jul 02 '19 at 07:00
  • Thanks for all your responses. Those are really helpful. – Sraw Jul 02 '19 at 07:01