How do I control file output size when using Beeline's INSERT OVERWRITE DIRECTORY?

Asked Feb 09 '22 at 19:07

Active Feb 09 '22 at 19:07

Viewed 267 times

I am running many commands like this to export data from hive as CSVs:

INSERT OVERWRITE DIRECTORY '/output/database/table/' 
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' 
SELECT * FROM database.table;

That command does exactly what I want it to, but I am getting totally random file sizes on the output ranging from 100mb to 500mb. I would like to specify a max file size for it to write out, around 200mb. Is this possible?

asked Feb 09 '22 at 19:07

Jared DuPont

Try merging settings: https://stackoverflow.com/a/48303807/2700344 and also you can try to control the number of files https://stackoverflow.com/a/55375261/2700344 – leftjoin Feb 10 '22 at 08:11
Hmm, so I can't predict the number of files it should be because I am running this on way too many tables. Unless I am not understanding merging, I don't thing it will help me in this situation because the files are larger than my target, not smaller. – Jared DuPont Feb 10 '22 at 15:06
See second link in my first comment, bytes.per.reducer allows to limit the MAX file size. And if the job is running as map-only,trigger the reducer step by adding something like `distribute by ` – leftjoin Feb 10 '22 at 15:49
And I'm not sure will merge work in combination with bytes.per.reducer, this is worth trying – leftjoin Feb 10 '22 at 15:56

How do I control file output size when using Beeline's INSERT OVERWRITE DIRECTORY?

0 Answers0