2

This is my first question ever so thanks in advance for answering me.

I want to create an external table by Spark in Azure Databricks. I've the data in my ADLS already that are automatically extracted from different sources every day. The folder structure on the storage is like ../filename/date/file.parquet.

I do not want to duplicate the files by saving their copy on another folder/container.

My problem is that I want to add a date column extracted from the folder path to the table neither without copying nor changing the source file.

I am using Spark SQL to create the table.

CREATE TABLE IF EXISTS my_ext_tbl
USING parquet
OPTIONS (path "/mnt/some-dir/source_files/")

Is there any proper way to add such a column in one easy and readable step or I have to read the raw data into Dataframe, add column and then save it as external tabel to different location?

I am aware of that unmanaged tables stores only metadata in dbfs. However, I am wondering is this even possible.

Hope it's clear.

EDIT: Since it seems like there is no viable solution for that without copying or interfere in source file, I would like to ask how are you handling such challenges?

EDIT2: I think that link might provide a solution. The difference in my case is that, the date inside the folder path is not the real partition, it's just a date added during the pipeline extracting data from external source.

intruderr
  • 355
  • 1
  • 7

0 Answers0