Apache Spark orc read performance when reading large number of small files

Question

When reading large number of orc files from HDFS under a directory spark doesn't launch any tasks until some amount of time and I don't see any tasks running during that time. I'm using below command to read orc and spark.sql configs.

What spark is doing under hoods when spark.read.orc is issued?

spark.read.schema(schame1).orc("hdfs://test1").filter("date >= 20181001")
"spark.sql.orc.enabled": "true",
"spark.sql.orc.filterPushdown": "true

Also instead of directly reading orc files I tried running Hive query on same dataset. But I was not able to push filter predicate. Where should I set the below config's "hive.optimize.ppd":"true", "hive.optimize.ppd.storage":"true"

Suggest what is the best way to read orc files from HDFS and tuning parameters ?

Storing _"large number of small files"_ on HDFS is looking for trouble, even with Spark -- cf. https://stackoverflow.com/questions/43895728/apache-spark-on-hdfs-read-10k-100k-of-small-files-at-once >> and that's even worse with ORC or Parquet, which are designed for LARGE files (i.e. 256 MB and above). — Samson Scharfrichter, Oct 31 '18 at 19:36
You might want to consider storing the data in another storage than HDFS, e.g. spark-redis — Guy Korland, Nov 06 '18 at 22:40

Apache Spark orc read performance when reading large number of small files

0 Answers0