Hive Macros/UDFs - Parallel/combined/single interpreter

Question

I would like to create a Hive extension (macro / UDF / gateway / proxy / facade or something else) which can

b) process data.

The issue here is that for b) parallel processing as common practice for UDFs is desired while it must be prevented for a), e.g. because I can’t add the same column multiple times to a table. On the external interfaces the solution should remain compatible to Hive connectors of SAS, SAP, R, Pentaho, i.e. it should still behave and be usable like Hive.

How would you recommend to achieve the execution of DB create/modify statements without running into errors from duplicate HQL commands due to parallel execution of the UDF?

My ideas:

2. Hive with UDFs + Hooking into the Hive parser. Inside the hook the HQL would just be executed once: Problem: Quite complex and requires changing the Hive settings (maybe not acceptable): http://stackoverflow.com/questions/17461932/hive-execution-hook

3. Hive with macros. Problem: Seems to be not perfectly mature, restricted (mainly numbers?) and poorly documented.

4. Hive with UDFs + explicit combiner RESTful service + JDBC/Beeline/HCatalog as Hive command API. Problem: SPoF (Single Point of Failure).

5. Hive with UDFs + Hive combiner based implementation. Problem: Impossible or very complex: http://stackoverflow.com/questions/10925840/partial-aggregation-vs-combiners-which-one-faster

6. Use an aggregate/combine approach like here described at the end for UDAFs: Problem: Not well documented, maybe not feasible: http://www.dataiku.com/blog/2013/05/01/a-complete-guide-to-writing-hive-udf.html

7. Derive from GenericUDAFEvaluator and implement merge to create the required SQL statements just once in a separate table. Some other mechanisms like a Hive macro could execute the commands that accumulate in this table. Problem: Complex implementation and the problems of Hive macros.

8. Using/extending a Hive gateway implementation like the Knox gateway-service-hive: http://repo1.maven.org/maven2/org/apache/knox/gateway-service-hive/0.8.0/ Problem: Too Knox-specific. A generic JDBC wrapper is a better basis in most cases.

Not quite sufficient:

10. The https://github.com/jwills/exhibit UDFs on GitHub are not quite sufficient but interesting.

11. getExistingHiveConnection() confuses the Hive parser; getNewHiveConnection() also doesn't solve the problem of unwanted parallel execution of HQL structural commands leading to errors.

I noticed that creating inside a UDF another connection to Hive leads to confusion of the Hive parser – understandable. However, inside a complex UDF I also noticed that in this context, the second .toString() doesn’t work any more, why?: public Object evaluate(GenericUDF.DeferredObject[] paramArrayOfDeferredObject) { … for (int i = 0; i < paramLen; i++) { paramObj = paramArrayOfDeferredObject[i].get(); paramObjectArr[i] = this.objConverters[0].convert(paramObj); tableName = paramObjectArr[i].toString(); // works …} later: String tableName = paramObjectArr[0].toString(); // does NOT work, empty string is assigned, why?

score 0 · Answer 1 · answered Apr 06 '16 at 12:11

@Thomas_Poetter - You provide a lot of considerations here but it isn't entirely clear to me what your actual usecase is. For instance, why not split 1 and 2 up completely and do your structural tasks directly to HS2 and put your processing in a UDF?

Oozie may also provide some interesting workflow capabilities for keeping this separate but executed as a single workflow.

Note that Knox is only a streaming gateway for JDBC/ODBC calls to HS2. Extending it for this sort of thing wouldn't make a lot of sense. However, you could provide a simple custom service to expose an API that Knox can proxy for you. If you need to execute your tasks independently from some external application then this might be useful.

The use case is to offer to an end customer a uniform API, ideally SQL/HQL statements that permits inserting the structure and the data for a corporate memory or some other highly structured database in a single run. Splitting it up into 2 phases with HiveServer2 and JDBC is a valid idea, but then either the customer needs to use 2 APIs or a complex mechanism for sychronizing the internally 2 APIs into a single externally facing one is necessary. — Thomas_Poetter, May 10 '16 at 20:06

Hive Macros/UDFs - Parallel/combined/single interpreter

1 Answers1