Highest Voted 'databricks-autoloader' Questions

7

votes

2 answers

Databricks Delta Live Tables - Apply Changes from delta table

I am working with Databricks Delta Live Tables, but have some problems with upserting some tables upstream. I know it is quite a long text below, but I tried to describe my problem as clear as possible. Let me know if some parts are not clear. I…

asked Jun 15 '22 at 09:42

gamezone25

288
2
10

4

votes

3 answers

Get the list of loaded files from Databricks Autoloader

We can use Autoloader to track the files that have been loaded from S3 bucket or not. My question about Autoloader: is there a way to read the Autoloader database to get the list of files that have been loaded? I can easily do this in AWS Glue job…

databricks databricks-autoloader

asked Dec 06 '21 at 11:16

Herry

455
3
14

3

votes

0 answers

consume gzip files with databricks autoloader

I am currently unable to find a direct way to load .gz files via autoloader. I can load the files as a binary content but I cannot extract the compressed xml files and process them further in a streaming way. Therefore, I would like to know if there…

gzip databricks databricks-autoloader

asked Oct 02 '22 at 15:22

Reevus

69
7

3

votes

1 answer

Load files in order with Databricks autoloader

I'm trying to write a python pipeline in Databricks to take CDC data from a postgres, dumped by DMS into s3 as parquet files and ingest it. The file names are numerically ascending unique ids based on datatime (ie 20220630-215325970.csv). Right now…

python spark-streaming databricks aws-databricks databricks-autoloader

asked Jul 05 '22 at 16:57

B. Bogart

998
6
15

3

votes

2 answers

Databricks cannot save stream checkpoint

I'm trying to set up the stream to begin processing incoming files. Looks like Databricks is unable to save a checkpoint. I tried location in ADLS Gen2 and DBFS with the same result. Databricks creates needed folder with some scructure but cannot…

spark-streaming databricks azure-databricks databricks-community-edition databricks-autoloader

asked Dec 13 '21 at 10:44

Vik Muzychko

51
6

2

votes

1 answer

databricks autoloader use MAP() type as a schema hint

I am attempting to set up a readStream using autoloader in pyspark databricks: spark.readStream.format("cloudFiles") \ .option("cloudFiles.format", "csv") \ .option("inferSchema", True) \ .option("cloudFiles.schemaLocation", schema_path) \ …

pyspark databricks databricks-autoloader

asked Mar 23 '23 at 17:18

wylie

173
11

2

votes

1 answer

How does streaming get triggered in Databricks with the File Notification option

How does spark readStream code get triggered in Databricks AutoLoader. I understand it is event driven process and a new file notification causes the file to be consumed. Should the below code be run as a job? If that's the case, how is…

pyspark databricks azure-databricks azure-data-lake-gen2 databricks-autoloader

asked Mar 11 '23 at 02:54

learner

833
3
13
24

2

votes

0 answers

Databricks Autoloader is reading files from subfolders without explicitely asked to do so

I'm working with Azure Databricks where I am stream reading files from an Azure Datalake container. I am using the autoloader functionality with .format("cloudFiles"). The file structure in the container looks like…

apache-spark pyspark azure-databricks spark-structured-streaming databricks-autoloader

asked Sep 01 '22 at 18:44

Milan Szabo

53
5

2

votes

1 answer

How to deal with invalid character(s) in column names when using databricks autoloader for csv?

I am attempting to setup a databricks autoloader stream to read a large amount of csv files, however I get the error Found invalid character(s) among " ,;{}()\n\t=" in the column names of your schema. due to the .csv column names containing spaces.…

python databricks azure-databricks databricks-autoloader

asked Aug 11 '22 at 10:56

FUUUUUUUVK

21
2

2

votes

1 answer

Azure Databricks can Autoloader use AD credential passthrough?

I am unable to authenticate to ADLS Gen2 when using Autoloader. My Databricks cluster is enabled with my AD credentials. This pass-through allows the following read and write from ADLS Gen2. filepath_read =…

azure-databricks databricks-autoloader

asked Jun 14 '22 at 17:46

Levi Huddleston

31
3

2

votes

1 answer

Databricks spark.readstream format differences

I am having confusion on the difference of the following code in Databricks spark.readStream.format('json') vs spark.readStream.format('cloudfiles').option('cloudFiles.format', 'json') I know cloudfiles as the format would be regarded as…

apache-spark databricks spark-structured-streaming databricks-autoloader

asked Jan 22 '22 at 10:26

mytabi

639
2
12
28

2

votes

2 answers

Databricks Autoloader - Column Transformation - Column is not iterable

I am using Azure Databricks Autoloader to process files from ADLS Gen 2 into Delta Lake. I have writen my Foreach batch funtion(pyspark) in the following manner : #Rename incoming dataframe columns schemadf =…

python databricks azure-databricks databricks-autoloader

asked Dec 06 '21 at 03:34

bunker

99
10

2

votes

1 answer

How to solve Error of offset mismatch in Azure Databricks Autoloader cloudfiles source?

This happens when some files are deleted from the data source that Autoloader stream is reading from. try: raw_df = spark.readStream.format("cloudFiles") \ .option("cloudFiles.format","csv") \ …

azure-databricks databricks-autoloader

asked Oct 01 '21 at 06:40

SWATHI.M SHETTY

21
1

2

votes

1 answer

Handling Duplicates in Databricks autoloader

I am new to this Databricks Autoloader, we have a requirement where we need to process the data from AWS s3 to delta table via Databricks autoloader. I was testing this autoloader so I came across duplicate issue that is if i upload a file with name…

apache-spark databricks spark-structured-streaming delta-lake databricks-autoloader

asked Sep 28 '21 at 08:26

MykG

109
1
11

2

votes

2 answers

Can we exclude or include only particular file extensions from Databricks Autoloader?

Right now the databricks autoloader requires a directory path where all the files will be loaded from. But in case some other kind of log files also start coming in in that directory - is there a way to ask Autoloader to exclude those files while…

databricks databricks-autoloader

asked Jul 09 '21 at 12:59

Keshav Agrawal

577
9
23

Questions tagged [databricks-autoloader]