1

I'm trying to load 2 CSV file from the HDFS folder,

Create Table Click_data ( A Timestamp, B string, C int, D BIGINT, E string, F string, G decimal(10,2), H BIGINT, I string) 

ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' 

STORED AS TEXTFILE Location ‘/user/data/’

tblproperties ("skip.header.line.count"="1");

Having 2 datasets in CSV format where we need to skip the headers how to load both files to the same table from the HDFS directory.

leftjoin
  • 36,950
  • 8
  • 57
  • 116
  • you can 1. concat files using `cat` command(remove header form second file) and then load them into the table. 2. Or load second file into a temp file and the load into original table using insert into – Koushik Roy Jan 10 '21 at 17:30
  • @KoushikRoy both steps you suggested are not necessary. – leftjoin Jan 11 '21 at 08:32
  • Table location looks wrong. It should be /user/data/click_data - or some other dedicated location for this table. – leftjoin Jan 11 '21 at 08:34

1 Answers1

1

You can have many files in the table location and header will be skipped in all files. Table location usually named as table: /user/data/click_data. Each table should have it's own location, /user/data/ seems not correct location, it looks like common folder where other tables locations are. Carefully check that location belongs to your table only.

Copy both files into table location '/user/data/' using hadoop fs -cp or hdfs dfs -cp or AWS CLI command aws s3 cp:

hadoop fs -cp 'source/path/to/files/*' '/user/data/Click_data'

Or you can use LOAD hive command for the same:

 hive> LOAD DATA INPATH 'source/path/to/files/file1.csv' INTO TABLE Click_data;
 hive> LOAD DATA INPATH 'source/path/to/files/file2.csv' INTO TABLE Click_data;

In both cases data will be selectable and header skipped in both files.

Alternatively you can create table on top of location where the files are (if there are no other files or folders which do not belong to your table). Just specify correct location. and use create EXTERNAL table ... . Also you can create table without location specified, check location using DESC FORMATTED command, then copy files into that location.

After loading files into table location better refresh table statistics, using ANALYZE command, this will allow CBO to work correctly. Read this: https://stackoverflow.com/a/41021682/2700344

leftjoin
  • 36,950
  • 8
  • 57
  • 116