1

I am using a hive external table to dump data as json. My dump files look fine. However after my dump, the files written by hive are of varied sizes ranging from around 400MB to 7GB. I want to have files of a fixed max size (say 1GB). But I am unable to do so. Please Help! My Query:

 INSERT OVERWRITE DIRECTORY '/myhdfs/location' 
    ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.DelimitedJSONSerDe' 
    select * from MY_EXTERNAL_TABLE; 

Hive Version: Hive 1.1.0-cdh5.14.2

Hadoop Version: Hadoop 2.6.0-cdh5.14.2

DMH
  • 3,875
  • 2
  • 26
  • 25
Koustav Ray
  • 1,112
  • 13
  • 26

1 Answers1

0

Set bytes per reducer limit and add distribute by(this will trigger reducer step), use some evenly distributed column or column list:

set hive.exec.reducers.bytes.per.reducer=1000000000; 

INSERT OVERWRITE DIRECTORY '/myhdfs/location' 
    ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.DelimitedJSONSerDe' 
    select * from MY_EXTERNAL_TABLE distribute by <column or col list here>; 

Check also this answer: https://stackoverflow.com/a/55375261/2700344

leftjoin
  • 36,950
  • 8
  • 57
  • 116
  • Although this seems like a legit answer, somehow it does not work, my reducers get stuck after 67%, however without the distribute by, my job runs fine and finishes by the use of mappers only, and zero reducers – Koustav Ray Dec 21 '20 at 07:54
  • @KoustavRay It may be because of skew in distribute by. Check counts group by If it is skew then try to find evenly distributed key – leftjoin Dec 21 '20 at 08:14
  • I distributed it by the Primary key, which is a UUID, and thus every record has just 1 entry. – Koustav Ray Dec 21 '20 at 10:48
  • @KoustavRay Try some column with low cardinality but evenly distributed, not to create so many groups like in case of PK – leftjoin Dec 21 '20 at 10:51
  • @KoustavRay Partition candidate (if you do not have partition yet) is a good column to be included in the distribute by – leftjoin Dec 21 '20 at 10:57