Hive can store table data as CSV in HDFS using OpenCSVSerde
. You can create a table over hdfs folder where you want the CSV file to appear:
CREATE EXTERNAL TABLE `csv_export`(
wf_id string,
file_name string,
row_count int
)
COMMENT 'output table'
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'escapeChar'='\\',
'quoteChar'='\"',
'separatorChar'=',')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'/user/admin/export/folder';
Then you can query data into using any select you want:
INSERT INTO csv_export SELECT a, b, c FROM source_table;
The INSERT INTO SELECT statement could be a sophisticate join that is looking for data quality issues. The resulting output would be the bad records.
The OpenCSVSerde
will create a one more more files for the rows inserted within the same statement:
hdfs dfs -ls /user/admin/export/folder
Found 2 items
-rwx------ 3 admin admin 885 2020-04-29 05:24 /user/admin/export/folder/000000_0
You can download a single file to disk and rename it:
hdfs dfs -copyToLocal /user/admin/export/folder/000000_0
mv 000000_0 output.cvs
If there is a lot of data you will get more than one file so it is safer to cat all the files in the folder into the local file
hdfs dfs -cat '/user/admin/export/folder/*' > output.cvs
The resulting output.csv will be properly formatted CSV data.