I would like to create a Hive extension (macro / UDF / gateway / proxy / facade or something else) which can
-
a) create/modify DB tables and
b) process data.
The issue here is that for b) parallel processing as common practice for UDFs is desired while it must be prevented for a), e.g. because I can’t add the same column multiple times to a table. On the external interfaces the solution should remain compatible to Hive connectors of SAS, SAP, R, Pentaho, i.e. it should still behave and be usable like Hive.
How would you recommend to achieve the execution of DB create/modify statements without running into errors from duplicate HQL commands due to parallel execution of the UDF?
My ideas:
-
1. Using a JDBC wrapper to take care of a) like described here and UDFs for b), Problem: Requires extra programming, (client/server side) installation and configuration: http://stackoverflow.com/questions/8293314/is-there-any-lightweight-jdbc-wrapper-with-these-features, https://www.redfin.com/blog/2011/03/boilerplate_jdbc_wrapper.html, http://www.javaworld.com/article/2074249/data-storage/create-your-own-type-3-jdbc-driver--part-1.html, http://jdbc.jcabi.com/
2. Hive with UDFs + Hooking into the Hive parser. Inside the hook the HQL would just be executed once: Problem: Quite complex and requires changing the Hive settings (maybe not acceptable): http://stackoverflow.com/questions/17461932/hive-execution-hook
3. Hive with macros. Problem: Seems to be not perfectly mature, restricted (mainly numbers?) and poorly documented.
4. Hive with UDFs + explicit combiner RESTful service + JDBC/Beeline/HCatalog as Hive command API. Problem: SPoF (Single Point of Failure).
5. Hive with UDFs + Hive combiner based implementation. Problem: Impossible or very complex: http://stackoverflow.com/questions/10925840/partial-aggregation-vs-combiners-which-one-faster
6. Use an aggregate/combine approach like here described at the end for UDAFs: Problem: Not well documented, maybe not feasible: http://www.dataiku.com/blog/2013/05/01/a-complete-guide-to-writing-hive-udf.html
7. Derive from GenericUDAFEvaluator and implement merge to create the required SQL statements just once in a separate table. Some other mechanisms like a Hive macro could execute the commands that accumulate in this table. Problem: Complex implementation and the problems of Hive macros.
8. Using/extending a Hive gateway implementation like the Knox gateway-service-hive: http://repo1.maven.org/maven2/org/apache/knox/gateway-service-hive/0.8.0/ Problem: Too Knox-specific. A generic JDBC wrapper is a better basis in most cases.
Not quite sufficient:
-
9. Adding "IF NOT EXISTS" to the statement for the HQL statements where this is supported. There are also other mechanisms like http://stackoverflow.com/questions/14381895/mysql-add-column-if-not-exist, http://stackoverflow.com/questions/972922/add-column-to-mysql-table-if-it-does-not-exist
10. The https://github.com/jwills/exhibit UDFs on GitHub are not quite sufficient but interesting.
11. getExistingHiveConnection() confuses the Hive parser; getNewHiveConnection() also doesn't solve the problem of unwanted parallel execution of HQL structural commands leading to errors.
I noticed that creating inside a UDF another connection to Hive leads to confusion of the Hive parser – understandable. However, inside a complex UDF I also noticed that in this context, the second .toString() doesn’t work any more, why?:
public Object evaluate(GenericUDF.DeferredObject[] paramArrayOfDeferredObject) { …
for (int i = 0; i < paramLen; i++) {
paramObj = paramArrayOfDeferredObject[i].get();
paramObjectArr[i] = this.objConverters[0].convert(paramObj);
tableName = paramObjectArr[i].toString(); // works
…}
later:
String tableName = paramObjectArr[0].toString(); // does NOT work, empty string is assigned, why?