Why is predicate pushdown not working?

Question

Programm Sketch

I create a HiveContext hiveContext.
With that context, I create a DataFrame df from a JDBC relational table.
I register the DataFrame df via df.registerTempTable("TESTTABLE").
I start a HiveThriftServer2 via HiveThriftServer2.startWithContext(hiveContext).

The TESTTABLE contains 1,000,000 entries, columns are ID (INT) and NAME (VARCHAR)

+-----+--------+
| ID  |  NAME  |
+-----+--------+
| 1   | Hello  |
| 2   | Hello  |
| 3   | Hello  |
| ... | ...    |

With Beeline I access the SQL Endpoint (at port 10000) of the HiveThriftServer and perform a query. E.g.

SELECT * FROM TESTTABLE WHERE ID='3'

When I inspect the QueryLog of the DB with the SQL Statements executed I see

/*SQL #:1000000 t:657*/  SELECT \"ID\",\"NAME\" FROM test;

So there happens no predicate pushdown , as the where clause is missing.

Questions

This gives raise to the following questions:

Why is no predicate pushdown performed?
Can this be changed by not using registerTempTable?
If so, how? Or is this a known restriction of the HiveThriftServer?

Counterexample

If I create a DataFrame df in Spark SQLContext and call

df.filter( df("ID") === 3).show()

I observe

/*SQL #:1*/SELECT \"ID\",\"NAME\" FROM test WHERE ID = 3;

as expected.

Did you try to execute the explain plan on your DataFrame filter? df.filter(...).explain(true) — eliasah, Nov 01 '15 at 13:40
@eliasah: Unfortunately, I see no way, where to place the `explain` command for the case Query -> ThriftServer -> HiveContext -> Query execution, as I can not programmatically control what the ThriftServer does or where to hook in. — Martin Senne, Nov 01 '15 at 18:42
You place it on the Dataframe, e.g let's consider a certain number of transformations that leads to a results in a DataFrame df. The command df.explain(true) will give you the query plan, physical plan and the logical plan of your built query. — eliasah, Nov 01 '15 at 18:46
@eliasah: For the latter case `df.filter(...)` I can call `df.filter(...).explain(true)`, but not for the first case where I have no programmatic control. The internal control flow (I suppose) is: SQL Query (via Beeline) -> ThriftServer -> .... -> Query Execution. The only thing I do is execute a query from Beeline. Having said, I have no DataFrame, on which to call `.explain(...)`. — Martin Senne, Nov 01 '15 at 18:49
Ok I see. I'm not sure if there is a similar procedure in hive thought. Sorry, I haven't dive into thirft and hive specifics yet. I wouldn't be able to help you on that first case — eliasah, Nov 01 '15 at 18:53
`EXPLAIN SELECT * FROM TESTTABLE WHERE ID='3'` should work although I doubt it will be useful. I tried Spark 1.5.1 in local mode with default settings combined with PostgreSQL 9.4 and simple predicates are pushed-down as expected. — zero323, Nov 01 '15 at 21:57

score 1 · Answer 1 · answered Aug 02 '18 at 19:16

1

It's probably too late to answer. In this scenario, this did not work because ID is defined as Int and in the original query you are passing a string ('3'). Predicate pushdown looks for the same column name and the type as well.

answered Aug 02 '18 at 19:16

alunarbeach

51
4

Why is predicate pushdown not working?

Programm Sketch

Questions

Counterexample

1 Answers1

Linked