I try to build a beeline script to load a CSV file stored in HDFS to an external table in Hive. First I create an empty external table:
CREATE EXTERNAL TABLE IF NOT EXISTS db.table
(column string)
STORED AS AVRO
LOCATION '/foo/bar'
After that I checked if the table was created. And it was. I already loaded a CSV file into HDFS with:
hdfs dfs -put test.csv /temp/raw_csv
Now I want to load/insert (whatever) this CSV data to the external Hive table, but stored in Avro fromat (like defined in the create script). I tried it with this line:
LOAD DATA INPATH '/temp/raw_csv/test.csv' OVERWRITE INTO TABLE db.table
This line runs without an error, see the cmd output here:
. . . . . . . . . . . . . . . . . . . . . . .> . . . . . . . . . . . . . . . . . . . . . . .> INFO : Compiling command(queryId=hive_20201201130808_4120504b-6799-4833-83e7-5fa8ff8c6ca8): LOAD DATA INPATH '/temp/raw_csv/test.csv' OVERWRITE INTO TABLE db.table INFO : Semantic Analysis Completed INFO : Returning Hive schema: Schema(fieldSchemas:null, properties:null) INFO : Completed compiling command(queryId=hive_20201201130808_4120504b-6799-4833-83e7-5fa8ff8c6ca8); Time taken: 0.427 seconds INFO : Executing command(queryId=hive_20201201130808_4120504b-6799-4833-83e7-5fa8ff8c6ca8): LOAD DATA INPATH '/temp/raw_csv/test.csv' OVERWRITE INTO TABLE db.table INFO : Starting task [Stage-0:MOVE] in serial mode INFO : Loading data to table db.table from hdfs://temp/raw_csv/test.csv INFO : Starting task [Stage-1:STATS] in serial mode INFO : Table db.table stats: [numFiles=1, totalSize=45981179] INFO : Completed executing command(queryId=hive_20201201130808_4120504b-6799-4833-83e7-5fa8ff8c6ca8); Time taken: 0.376 seconds INFO : OK No rows affected (0.87 seconds)
But if I want to select the table with following line, I get an error:
select * FROM db.test;
INFO : Compiling command(queryId=hive_20201201131414_79187a87-c5e6-4b54-aecc-54c15d9a4bfa): select * FROM db.test INFO : Semantic Analysis Completed INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:column, type:string, comment:null) INFO : Completed compiling command(queryId=hive_20201201131414_79187a87-c5e6-4b54-aecc-54c15d9a4bfa); Time taken: 0.243 seconds INFO : Executing command(queryId=hive_20201201131414_79187a87-c5e6-4b54-aecc-54c15d9a4bfa): select * FROM db.test INFO : Completed executing command(queryId=hive_20201201131414_79187a87-c5e6-4b54-aecc-54c15d9a4bfa); Time taken: 0.006 seconds INFO : OK Error: java.io.IOException: java.io.IOException: Not a data file. (state=,code=0)
Is it possible to load data from CSV to Hive Avro table with an beeline command? And/Or what will be a better way to load this data? I normally load 1-100 million rows of data.