After a recent upgrade to HDP 3.1
now using spark 2.3.x
instead of 2.2.x
a query like:
spark.sql("SELECT * from mydb.mytable").filter('partition_date between "202001010000" and "202001020000").write.parquet("foo.out")
sometimes fails when reading from an HDFS backed hive table (no object storage). You have to know that the underlying data (an EXTERNAL table in Hive) has a data retention period and any data older than this date will be deleted. Sometimes, this deletion might occur during the execution of the above-mentioned query. The deletion happens every 5 minutes.
Even though:
PartitionFilters: [isnotnull(partition_date#3099), (partition_date#3099 >= 202001010000), (partition_date#3099 <= 202001020000)]
partition filtering (predicate pushdown) seems to be enabled more than the desired partitions are read during the initial path traversal. After the upgrade to 2.3, Spark shows in the UI the progress of listing file directories. Interestingly, we always get two entries. One for the oldest available directory, and one for the lower of the two boundaries of interest:
Listing leaf files and directories for 380 paths:
/path/to/files/on/hdfs/mydb.db/mytable/partition_date==202001010000/sub_part=0, ...
Listing leaf files and directories for 7100 paths:
/path/to/files/on/hdfs/mydb.db/mytable/partition_date=201912301300/sub_part=0, ...
Notice:
- the logged number of files (308, 7100) both do not seem to reflect what a manual check would suggest
- the job (sometimes) fails during the recursive listing of leaf files
the error message:
File does not exist: /path/to/files/on/hdfs/mydb.db/mytable/partition_date=201912301300/sub_part=0/file_name_unique_hash_timestamp.par
How can I force Spark to list only directories in the desired interval and not outside and potentially collide with the maximum data retention duration?
It looks like this is related:
- Spark lists all leaf node even in partitioned data (though for S3)