I've seen a number of questions describing problems when working with S3 in Spark:
- Spark jobs finishes but application takes time to close
- spark-1.4.1 saveAsTextFile to S3 is very slow on emr-4.0.0
- Writing Spark checkpoints to S3 is too slow
many specifically describing issues with Parquet files:
- Slow or incomplete saveAsParquetFile from EMR Spark to S3
- Does Spark support Partition Pruning with Parquet Files
- is Parquet predicate pushdown works on S3 using Spark non EMR?
- Huge delays translating the DAG to tasks
- Fast Parquet row count in Spark
as well as some external sources referring to other issues with Spark - S3 - Parquet combinations. It makes me think that either S3 with Spark or this complete combination may not be the best choice.
Am I into something here? Can anyone provide an authoritative answer explaining:
- Current state of the Parquet support with focus on S3.
- Can Spark (SQL) fully take advantage of Parquet features like partition pruning, predicate pushdown (including deeply nested schemas) and Parquet metadata Do all of these feature work as expected on S3 (or compatible storage solutions).
- Ongoing developments and opened JIRA tickets.
- Are there any configuration options which should be aware of when using these three together?