When reading large number of orc files from HDFS under a directory spark doesn't launch any tasks until some amount of time and I don't see any tasks running during that time. I'm using below command to read orc and spark.sql configs.
What spark is doing under hoods when spark.read.orc is issued?
spark.read.schema(schame1).orc("hdfs://test1").filter("date >= 20181001")
"spark.sql.orc.enabled": "true",
"spark.sql.orc.filterPushdown": "true
Also instead of directly reading orc files I tried running Hive query on same dataset. But I was not able to push filter predicate. Where should I set the below config's
"hive.optimize.ppd":"true",
"hive.optimize.ppd.storage":"true"
Suggest what is the best way to read orc files from HDFS and tuning parameters ?