2

Programm Sketch

  • I create a HiveContext hiveContext.
  • With that context, I create a DataFrame df from a JDBC relational table.
  • I register the DataFrame df via df.registerTempTable("TESTTABLE").
  • I start a HiveThriftServer2 via HiveThriftServer2.startWithContext(hiveContext).

The TESTTABLE contains 1,000,000 entries, columns are ID (INT) and NAME (VARCHAR)

+-----+--------+
| ID  |  NAME  |
+-----+--------+
| 1   | Hello  |
| 2   | Hello  |
| 3   | Hello  |
| ... | ...    |

With Beeline I access the SQL Endpoint (at port 10000) of the HiveThriftServer and perform a query. E.g.

SELECT * FROM TESTTABLE WHERE ID='3'

When I inspect the QueryLog of the DB with the SQL Statements executed I see

/*SQL #:1000000 t:657*/  SELECT \"ID\",\"NAME\" FROM test;

So there happens no predicate pushdown , as the where clause is missing.

Questions

This gives raise to the following questions:

  • Why is no predicate pushdown performed?
  • Can this be changed by not using registerTempTable?
  • If so, how? Or is this a known restriction of the HiveThriftServer?

Counterexample

If I create a DataFrame df in Spark SQLContext and call

df.filter( df("ID") === 3).show()

I observe

/*SQL #:1*/SELECT \"ID\",\"NAME\" FROM test WHERE ID = 3;

as expected.

Martin Senne
  • 5,939
  • 6
  • 30
  • 47
  • Did you try to execute the explain plan on your DataFrame filter? df.filter(...).explain(true) – eliasah Nov 01 '15 at 13:40
  • 1
    Ohh, good hint. Will try that out!!! – Martin Senne Nov 01 '15 at 13:50
  • So what did the explain plan give you? – eliasah Nov 01 '15 at 14:23
  • @eliasah: Unfortunately, I see no way, where to place the `explain` command for the case Query -> ThriftServer -> HiveContext -> Query execution, as I can not programmatically control what the ThriftServer does or where to hook in. – Martin Senne Nov 01 '15 at 18:42
  • You place it on the Dataframe, e.g let's consider a certain number of transformations that leads to a results in a DataFrame df. The command df.explain(true) will give you the query plan, physical plan and the logical plan of your built query. – eliasah Nov 01 '15 at 18:46
  • @eliasah: For the latter case `df.filter(...)` I can call `df.filter(...).explain(true)`, but not for the first case where I have no programmatic control. The internal control flow (I suppose) is: SQL Query (via Beeline) -> ThriftServer -> .... -> Query Execution. The only thing I do is execute a query from Beeline. Having said, I have no DataFrame, on which to call `.explain(...)`. – Martin Senne Nov 01 '15 at 18:49
  • 1
    Ok I see. I'm not sure if there is a similar procedure in hive thought. Sorry, I haven't dive into thirft and hive specifics yet. I wouldn't be able to help you on that first case – eliasah Nov 01 '15 at 18:53
  • 1
    Still, many thanks for your help. – Martin Senne Nov 01 '15 at 18:56
  • 1
    `EXPLAIN SELECT * FROM TESTTABLE WHERE ID='3'` should work although I doubt it will be useful. I tried Spark 1.5.1 in local mode with default settings combined with PostgreSQL 9.4 and simple predicates are pushed-down as expected. – zero323 Nov 01 '15 at 21:57
  • @zero: Great, will try that out. – Martin Senne Nov 02 '15 at 06:33

1 Answers1

1

It's probably too late to answer. In this scenario, this did not work because ID is defined as Int and in the original query you are passing a string ('3'). Predicate pushdown looks for the same column name and the type as well.