0

I am running many commands like this to export data from hive as CSVs:

INSERT OVERWRITE DIRECTORY '/output/database/table/' 
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' 
SELECT * FROM database.table;

That command does exactly what I want it to, but I am getting totally random file sizes on the output ranging from 100mb to 500mb. I would like to specify a max file size for it to write out, around 200mb. Is this possible?

Jared DuPont
  • 107
  • 1
  • 7
  • Try merging settings: https://stackoverflow.com/a/48303807/2700344 and also you can try to control the number of files https://stackoverflow.com/a/55375261/2700344 – leftjoin Feb 10 '22 at 08:11
  • Hmm, so I can't predict the number of files it should be because I am running this on way too many tables. Unless I am not understanding merging, I don't thing it will help me in this situation because the files are larger than my target, not smaller. – Jared DuPont Feb 10 '22 at 15:06
  • See second link in my first comment, bytes.per.reducer allows to limit the MAX file size. And if the job is running as map-only,trigger the reducer step by adding something like `distribute by ` – leftjoin Feb 10 '22 at 15:49
  • And I'm not sure will merge work in combination with bytes.per.reducer, this is worth trying – leftjoin Feb 10 '22 at 15:56

0 Answers0