Hive insert overwrite directory split records into equal file sizes

Question

I am using a hive external table to dump data as json. My dump files look fine. However after my dump, the files written by hive are of varied sizes ranging from around 400MB to 7GB. I want to have files of a fixed max size (say 1GB). But I am unable to do so. Please Help! My Query:

 INSERT OVERWRITE DIRECTORY '/myhdfs/location' 
    ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.DelimitedJSONSerDe' 
    select * from MY_EXTERNAL_TABLE;

Hive Version: Hive 1.1.0-cdh5.14.2

Hadoop Version: Hadoop 2.6.0-cdh5.14.2

score 0 · Answer 1 · answered Dec 17 '20 at 14:54

0

Set bytes per reducer limit and add distribute by(this will trigger reducer step), use some evenly distributed column or column list:

set hive.exec.reducers.bytes.per.reducer=1000000000; 

INSERT OVERWRITE DIRECTORY '/myhdfs/location' 
    ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.DelimitedJSONSerDe' 
    select * from MY_EXTERNAL_TABLE distribute by <column or col list here>;

Check also this answer: https://stackoverflow.com/a/55375261/2700344

answered Dec 17 '20 at 14:54

leftjoin

36,950
8
57
116

Although this seems like a legit answer, somehow it does not work, my reducers get stuck after 67%, however without the distribute by, my job runs fine and finishes by the use of mappers only, and zero reducers – Koustav Ray Dec 21 '20 at 07:54
@KoustavRay It may be because of skew in distribute by. Check counts group by If it is skew then try to find evenly distributed key – leftjoin Dec 21 '20 at 08:14
I distributed it by the Primary key, which is a UUID, and thus every record has just 1 entry. – Koustav Ray Dec 21 '20 at 10:48
@KoustavRay Try some column with low cardinality but evenly distributed, not to create so many groups like in case of PK – leftjoin Dec 21 '20 at 10:51
@KoustavRay Partition candidate (if you do not have partition yet) is a good column to be included in the distribute by – leftjoin Dec 21 '20 at 10:57

Hive insert overwrite directory split records into equal file sizes

1 Answers1