Questions tagged [databricks-autoloader]
69 questions
7
votes
2 answers
Databricks Delta Live Tables - Apply Changes from delta table
I am working with Databricks Delta Live Tables, but have some problems with upserting some tables upstream. I know it is quite a long text below, but I tried to describe my problem as clear as possible. Let me know if some parts are not clear.
I…

gamezone25
- 288
- 2
- 10
4
votes
3 answers
Get the list of loaded files from Databricks Autoloader
We can use Autoloader to track the files that have been loaded from S3 bucket or not. My question about Autoloader: is there a way to read the Autoloader database to get the list of files that have been loaded?
I can easily do this in AWS Glue job…

Herry
- 455
- 3
- 14
3
votes
0 answers
consume gzip files with databricks autoloader
I am currently unable to find a direct way to load .gz files via autoloader. I can load the files as a binary content but I cannot extract the compressed xml files and process them further in a streaming way.
Therefore, I would like to know if there…

Reevus
- 69
- 7
3
votes
1 answer
Load files in order with Databricks autoloader
I'm trying to write a python pipeline in Databricks to take CDC data from a postgres, dumped by DMS into s3 as parquet files and ingest it. The file names are numerically ascending unique ids based on datatime (ie 20220630-215325970.csv). Right now…

B. Bogart
- 998
- 6
- 15
3
votes
2 answers
Databricks cannot save stream checkpoint
I'm trying to set up the stream to begin processing incoming files. Looks like Databricks is unable to save a checkpoint. I tried location in ADLS Gen2 and DBFS with the same result. Databricks creates needed folder with some scructure but cannot…

Vik Muzychko
- 51
- 6
2
votes
1 answer
databricks autoloader use MAP() type as a schema hint
I am attempting to set up a readStream using autoloader in pyspark databricks:
spark.readStream.format("cloudFiles") \
.option("cloudFiles.format", "csv") \
.option("inferSchema", True) \
.option("cloudFiles.schemaLocation", schema_path) \
…

wylie
- 173
- 11
2
votes
1 answer
How does streaming get triggered in Databricks with the File Notification option
How does spark readStream code get triggered in Databricks AutoLoader.
I understand it is event driven process and a new file notification causes the file to be consumed.
Should the below code be run as a job? If that's the case, how is…

learner
- 833
- 3
- 13
- 24
2
votes
0 answers
Databricks Autoloader is reading files from subfolders without explicitely asked to do so
I'm working with Azure Databricks where I am stream reading files from an Azure Datalake container. I am using the autoloader functionality with .format("cloudFiles").
The file structure in the container looks like…

Milan Szabo
- 53
- 5
2
votes
1 answer
How to deal with invalid character(s) in column names when using databricks autoloader for csv?
I am attempting to setup a databricks autoloader stream to read a large amount of csv files, however I get the error
Found invalid character(s) among " ,;{}()\n\t=" in the column names of your schema. due to the .csv column names containing spaces.…

FUUUUUUUVK
- 21
- 2
2
votes
1 answer
Azure Databricks can Autoloader use AD credential passthrough?
I am unable to authenticate to ADLS Gen2 when using Autoloader. My Databricks cluster is enabled with my AD credentials. This pass-through allows the following read and write from ADLS Gen2.
filepath_read =…

Levi Huddleston
- 31
- 3
2
votes
1 answer
Databricks spark.readstream format differences
I am having confusion on the difference of the following code in Databricks
spark.readStream.format('json')
vs
spark.readStream.format('cloudfiles').option('cloudFiles.format', 'json')
I know cloudfiles as the format would be regarded as…

mytabi
- 639
- 2
- 12
- 28
2
votes
2 answers
Databricks Autoloader - Column Transformation - Column is not iterable
I am using Azure Databricks Autoloader to process files from ADLS Gen 2 into Delta Lake. I have writen my Foreach batch funtion(pyspark) in the following manner :
#Rename incoming dataframe columns
schemadf =…

bunker
- 99
- 10
2
votes
1 answer
How to solve Error of offset mismatch in Azure Databricks Autoloader cloudfiles source?
This happens when some files are deleted from the data source that Autoloader stream is reading from.
try:
raw_df = spark.readStream.format("cloudFiles") \
.option("cloudFiles.format","csv") \
…

SWATHI.M SHETTY
- 21
- 1
2
votes
1 answer
Handling Duplicates in Databricks autoloader
I am new to this Databricks Autoloader, we have a requirement where we need to process the data from AWS s3 to delta table via Databricks autoloader. I was testing this autoloader so I came across duplicate issue that is if i upload a file with name…

MykG
- 109
- 1
- 11
2
votes
2 answers
Can we exclude or include only particular file extensions from Databricks Autoloader?
Right now the databricks autoloader requires a directory path where all the files will be loaded from. But in case some other kind of log files also start coming in in that directory - is there a way to ask Autoloader to exclude those files while…

Keshav Agrawal
- 577
- 9
- 23