How to efficiently find the latest partition from a S3 dataset using Spark

Question

I have dataset that has data added almost everyday, and needs to be processed everyday in a part of a larger ETL. When I select the partition directly, the query is really fast:

SELECT * FROM JSON.`s3://datalake/partitoned_table/?partition=2019-05-20`

Yet, the issue is that the event type does not generate data on some Sundays, resulting in a non-existing partition on that particular day. Because of this, I cannot use the previous statement to run my daily job.

Another attempt led me to try to have spark find the latest partition of that dataset, in order to be sure the bigger query wouldn't fail:

SELECT * FROM JSON.`s3://datalake/partitoned_table/`
WHERE partition = (SELECT MAX(partition) FROM JSON.`s3://datalake/partitoned_table/`)

This works every time, but it is unbelievably slow.

I found numerous articles and reference on how to build and maintain partitions, yet nothing about how to read them correctly.

Any idea how to have this done properly?

Your question is already answered [here](https://stackoverflow.com/questions/56319532/how-to-extract-latest-recent-partition-from-the-list-of-year-month-day-partition?answertab=votes#tab-top). Basically [SHOW PARTITIONS](https://docs.databricks.com/spark/latest/spark-sql/language-manual/show-partitions.html#show-partitions) — Gsquare, May 27 '19 at 23:34

score 4 · Accepted Answer · answered May 27 '19 at 14:16

(SELECT MAX(partition) FROM JSON.s3://datalake/partitoned_table/)
This query will be executed as a subquery in Spark.
Reason for slowness
1. Subquery needs to be executed completely before the actual query execution starts.
2. The Above query will list all the S3 files to retrieve the partition information. If the folder has a large number of files, this process will take a long time. Time taken for listing is directly proportional to the number of files.

We could create a table on top of s3://datalake/partitoned_table/ with the partitioning scheme, let's say the name of the table is tbl
You could perform an
ALTER TABLE tbl RECOVER PARTITIONS
which stores the partition information in metastore. This also involves a listing, but it is a one-time operation and spark spawns multiple threads to perform the listing to make it faster.

Then we could fire

SELECT * FROM tbl WHERE partition = (SELECT MAX(partition) FROM tbl`)

Which will get the partition information only from metastore, without having to list the object store which I believe is an expensive operation.
The cost incurred in this approach is that of recovering partitions.
After which all queries will be faster(when data for new partition comes, we need to recover partitions again)

Kishore · Answer 2 · 2019-05-27T16:07:02.633

WorkAround when you don't have Hive-

FileSystem.get(URI.create("s3://datalake/partitoned_table"), conf).listStatus(new Path("s3://datalake/partitoned_table/"))

Above code will give you list of file partitions example - List(s3://datalake/partitoned_table/partition=2019-05-20, s3://datalake/partitoned_table/partition=2019-05-21....)

This is very efficient because it is only fetching metadata from the s3 location.

Just take the latest file partitions and use it your SQL.

How to efficiently find the latest partition from a S3 dataset using Spark

2 Answers2