8

Will data extracts run quicker if a DataFrame is sorted before being persisted as Parquet files.

Suppose we have the following peopleDf DataFrame (pretend this a sample and the real one has 20 billion rows):

+-----+----------------+
| age | favorite_color |
+-----+----------------+
|  54 | blue           |
|  10 | black          |
|  13 | blue           |
|  19 | red            |
|  89 | blue           |
+-----+----------------+

Let's write out sorted and unsorted versions of this DataFrame to Parquet files.

peopleDf.write.parquet("s3a://some-bucket/unsorted/")
peopleDf.sort($"favorite_color").write.parquet("s3a://some-bucket/sorted/")

Are there any performance gains when reading in the sorted data and doing a data extract based on favorite_color?

val pBlue1 = spark.read.parquet("s3a://some-bucket/unsorted/").filter($"favorite_color" === "blue")

// is this faster?

val pBlue2 = spark.read.parquet("s3a://some-bucket/sorted/").filter($"favorite_color" === "blue")
Powers
  • 18,150
  • 10
  • 103
  • 108

1 Answers1

3

Sorting provides a number of benefits:

  • more efficient filtering using file metadata.
  • more efficient compression rate.

If you want to filter on single column partitioning on that column can be more efficient and doesn't require shuffle although there some related issues right now:

Community
  • 1
  • 1