how to constraint hive query file output to be in a single file always

Question

I have created a hive table using below query, and inserting data to this table on daily basis using second query as mentioned below

create EXTERNAL table IF NOT EXISTS DB.efficacy
(
product string,
TP_Silent INT,
TP_Active INT,
server_date date
)

ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 'hdfs://hdfsadlproduction/user/DB/Report/efficacy';

Insert INTO DB.efficacy
select 
    product,
    SUM(CASE WHEN verdict = 'TP_Silent' THEN 1 ELSE 0 END ),
    SUM(CASE WHEN verdict = 'TP_Active' THEN 1 ELSE 0 END ) ,
    current_date()
from
    DB.efficacy_raw
group by 
    product
;

The issue is that everyday when my insert query executes it basically creates a new file in hadoop FS. I want every day query output to get appended in a same single file only, but Hadoop FS contains the files in the following manner. 000000_0, 000000_0_copy_1, 000000_0_copy_2

I have used below hive settings:-

SET hive.execution.engine=mr;
SET tez.queue.name=${queueName};
SET mapreduce.job.queuename=${queueName};
SET mapreduce.map.memory.mb  = 8192;
SET mapreduce.reduce.memory.mb = 8192;
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
SET hive.exec.parallel = true;
SET hive.exec.parallel.thread.number = 2;
SET mapreduce.input.fileinputformat.split.maxsize=2048000000;
SET mapreduce.input.fileinputformat.split.minsize=2048000000;
SET mapreduce.job.reduces = 20;
SET hadoop.security.credential.provider.path=jceks://hdfs/user/efficacy/s3-access/efficacy.jceks;
set hive.vectorized.execution.enabled=false;
set hive.enforce.bucketmapjoin=false;
set hive.optimize.bucketmapjoin.sortedmerge=false;
set hive.enforce.sortmergebucketmapjoin=false;
set hive.optimize.bucketmapjoin=false;
set hive.exec.dynamic.partition.mode=nostrict;
set hive.exec.compress.intermediate=false;
set hive.exec.compress.output=false;
**set hive.exec.reducers.max=1;**

I am beginner into hive and hadoop era so pl excuse. Any help will be greatly appreciated

Note:- I am using Hadoop 2.7.3.2.5.0.55-1

During INSERT (not overwrite) Hive will create additional files with new rows. If you want new rows to be written as single file, use `hive.merge.*` settings. See this answer: https://stackoverflow.com/a/45266244/2700344 I recommend to use ORC format. There is ALTER table CONCATENATE for ORC, this will allow to merge all files (old and newly inserted data). — leftjoin, Mar 19 '18 at 07:20
Thanks for recommending the post. I tried using hive.merge.* settings as mentioned in the above post, but still I am getting second file named "000000_0_copy_1" when I run INSERT INTO TABLE cmd — Ashish Mittal, Mar 19 '18 at 10:28
thanks leftjoin, I changed my table format to be stored as ORC file , and I can merge all the small files using CONCATENATE hive query as mentioned by you. I will add answer to this post by adding your solution. — Ashish Mittal, Mar 21 '18 at 01:34

score 0 · Accepted Answer · answered Mar 21 '18 at 01:38

I didn't see any direct mechanism available or hive settings which will automatically merge all the small files at the end of the query. The concatenation of small files are currently not supported for files stored as text file.

As per the comment by "leftjoin" in my post, I have created the table in ORC format, and then used CONCATENATE hive query to merge all the small files into single big file.

I then used below hive query to export data from this single big ORC file into single text file, and could able to do my task with this exported text file.

hive#INSERT OVERWRITE DIRECTORY '<Hdfs-Directory-Path>'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
SELECT * FROM default.foo;

Courtesy:- https://community.hortonworks.com/questions/144122/convert-orc-table-data-into-csv.html

how to constraint hive query file output to be in a single file always

1 Answers1