1

I have created a bash script (GitHub Link) to query for all hive databases; query each table within them and parse the lastUpdateTime of those tables and extract them to a csv with columns "tablename,lastUpdateTime".

This query is however slow because in each iteration, the call to "hive -e..." starts a new hive cli command which takes noticeably significant amount of time to load.

Is there a way to speed up either loading up the hive cli or speed up the query in some other way to solve the same problem?

I have thought about loading the hive cli just once at the start of the script and try to call the bash commands from within the hive cli using the ! <command> method but not sure how to do loops then within the cli and also if I can process the loops inside a bash script file and execute that, then I am not sure how to pass the results of queries executed within hive cli as arguments to this script.

Without giving specification about the system I am running it on, the script can process about ~10 tables per minute which I think is really slow considering there can be thousands of tables in the database we want to apply it on.

pyAdmin
  • 321
  • 4
  • 14
  • 1
    You can run Hive commands in parallel, like in this example: https://stackoverflow.com/a/53755790/2700344 parametrize output filename to be like this ${DB}.csv and add ampersand to the end. Add waiting like in my answer – leftjoin Jun 14 '19 at 12:53
  • I already do parallel processing of hive queries by using the -P argument in xargs http://man7.org/linux/man-pages/man1/xargs.1.html which I see is what you suggest towards the end of your answer also. Any other ideas? – pyAdmin Jun 15 '19 at 14:55
  • Yeah, I see. Direct access to the metastore database then: https://sharebigdata.wordpress.com/2016/06/12/hive-metastore-internal-tables/ – leftjoin Jun 15 '19 at 17:14

0 Answers0