4

We can use Autoloader to track the files that have been loaded from S3 bucket or not. My question about Autoloader: is there a way to read the Autoloader database to get the list of files that have been loaded?

I can easily do this in AWS Glue job bookmark, but I'm not aware on how to do this in Databricks Autoloader.

Herry
  • 455
  • 3
  • 14
  • Can I get reference to AWS Glue job bookmark capability that you are looking. There is lot of code being added in Glue job to do bookmark. You can see below autoloader code is simple, only two statements – akshat thakar Feb 09 '22 at 14:35

3 Answers3

1
.load("path")
.withColumn("filePath",input_file_name())

than you can for example insert filePath to your stream sink and than get distinct value from there or use forEatch / forEatchBatch and for example insert it into spark sql table

Hubert Dudek
  • 1,666
  • 1
  • 13
  • 21
1

If you are using the checkpointLocation option you can read all the files that were processed by reading the rocksDB logs. Some example code to achieve that, note that you need to point to the path on the checkpoint location that you want to retrieve the loaded files list.

from glob import glob
import codecs

directory = "<YOUR_PATH_GOES_HERE>/sources/*/rocksdb/logs/"
for file in glob(f"{directory}/*.log"):
    with codecs.open(file, encoding='utf-8', errors='ignore') as f:
        f = f.readlines()
        print(f)

PS.: The logs need to be parsed properly in order to get only the filenames.

0

There is a cloud_files_state SQL function that exposes this in Databricks Runtime 10.5+.

For example,

SELECT * FROM cloud_files_state(<checkpoint_location>)
  • While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes. - [From Review](/review/late-answers/33260749) – Rich Nov 30 '22 at 20:27