4

I'm new to my role and part of it requires creating/inserting data into both managed and external hive tables. We have a few lines of 'set' parameters that we run at the beginning of a hive session, but I've run into a few cases, where, for example, the files are merged for some partitions (few number of files), but not others (many smaller files), seemingly on random days.

My question is: when is it necessary to enter all of my Hive set parameters? Does it need to be done for every single insert/command/statement I'm running? Or just once at the beginning of the Hive session when I've launched Hive?

These are the standard set parameters we've been using:

SET mapred.job.queue.name=yometrics;
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
SET hive.exec.max.dynamic.partitions=2000;
SET hive.exec.max.dynamic.partitions.pernode=2000;
SET hive.merge.tezfiles=true;
leftjoin
  • 36,950
  • 8
  • 57
  • 116
phenderbender
  • 625
  • 2
  • 8
  • 18

1 Answers1

4

You can put configuration in the beginning of the file, it will work for the whole session.

Alternatively you can put common parameters in the separate file params.hql and in each script call

source /local/path/to/the/file/params.hql in the beginning.

Also you can put them in the hive-site.xml

Also you can use bootstrap for the same if you are on Qubole/AWS: https://docs.qubole.com/en/latest/user-guide/hive/bootstrap-script.html

leftjoin
  • 36,950
  • 8
  • 57
  • 116
  • thank you - can you clarify what a 'session' entails? Is it only running a single INSERT statement, for example? Is that a single session? If I were to run the set parameters when I initialize Hive, and say, run 10 INSERT statements separately, do I need to run the set statements every single time? Or only once at the very beginning? – phenderbender Nov 12 '19 at 17:30
  • @phenderbender Session it is one hive invocation with a script file (-f option), which can include other files using SOURCE command or inline ( -e option). All the whole script is a single session. For particular statements you can override settings in the same script before statement. – leftjoin Nov 12 '19 at 19:16
  • @phenderbender Have a look at this answer about merge: https://stackoverflow.com/a/48303807/2700344 and this one: https://stackoverflow.com/a/45266244/2700344 – leftjoin Nov 12 '19 at 19:17
  • thank you @leftjoin Do you know whether SET hive.merge.tezfiles=true; versus set hive.merge.mapfiles=true; is better for external tables? I just tried running with both of these, and it got my file count down drastically. I'm trying again with just hive.merge.mapfiles=true since adding that seemed to do the trick, but I can't find anything online for which to use on external vs managed – phenderbender Nov 12 '19 at 20:36
  • @phenderbender It does not matter managed or external in this context. hive.merge.mapfiles will work if the job is map-only, no reducer. Force reducer by adding `order by `and you will see. – leftjoin Nov 13 '19 at 04:56