1

Help me please to write an optimal spark query. I have read about predicate pushdown:

When you execute where or filter operators right after loading a dataset, Spark SQL will try to push the where/filter predicate down to the data source using a corresponding SQL query with WHERE clause (or whatever the proper language for the data source is).

Will predicate pushdown works after .as(Encoders.kryo(MyObject.class)) operation?

spark
   .read()
   .parquet(params.getMyObjectsPath())

   // As I understand predicate pushdown will work here
   // But I should construct MyObject from org.apache.spark.sql.Row manually

   .as(Encoders.kryo(MyObject.class))

   // QUESTION: will predicate pushdown work here as well?

   .collectAsList();
VB_
  • 45,112
  • 42
  • 145
  • 293

1 Answers1

1

It won't work. After you use Encoders.kryo you get just a blob which doesn't really benefit from columnar storage and doesn't provide efficient (without object deserialization) access to individual fields, not to mention predicate pushdown or more advanced optimizations.

You could be better off with Encoders.bean if the MyObject class allows for that. In general to get a full advantage of Dataset optimizations you'll need at least a type which can be encoded using more specific encoder.

Related Spark 2.0 Dataset vs DataFrame

Community
  • 1
  • 1
zero323
  • 322,348
  • 103
  • 959
  • 935