How to speed up this query to retrieve lastUpdateTime of all hive tables?

Question

I have created a bash script (GitHub Link) to query for all hive databases; query each table within them and parse the lastUpdateTime of those tables and extract them to a csv with columns "tablename,lastUpdateTime".

This query is however slow because in each iteration, the call to "hive -e..." starts a new hive cli command which takes noticeably significant amount of time to load.

Is there a way to speed up either loading up the hive cli or speed up the query in some other way to solve the same problem?

I have thought about loading the hive cli just once at the start of the script and try to call the bash commands from within the hive cli using the ! <command> method but not sure how to do loops then within the cli and also if I can process the loops inside a bash script file and execute that, then I am not sure how to pass the results of queries executed within hive cli as arguments to this script.

Without giving specification about the system I am running it on, the script can process about ~10 tables per minute which I think is really slow considering there can be thousands of tables in the database we want to apply it on.

You can run Hive commands in parallel, like in this example: https://stackoverflow.com/a/53755790/2700344 parametrize output filename to be like this ${DB}.csv and add ampersand to the end. Add waiting like in my answer — leftjoin, Jun 14 '19 at 12:53
I already do parallel processing of hive queries by using the -P argument in xargs http://man7.org/linux/man-pages/man1/xargs.1.html which I see is what you suggest towards the end of your answer also. Any other ideas? — pyAdmin, Jun 15 '19 at 14:55
Yeah, I see. Direct access to the metastore database then: https://sharebigdata.wordpress.com/2016/06/12/hive-metastore-internal-tables/ — leftjoin, Jun 15 '19 at 17:14

How to speed up this query to retrieve lastUpdateTime of all hive tables?

0 Answers0