I am trying to read a file from ADLS Gen2 in Synapse and want to authenticate with the account key.
According to the docs, the following should work but doesnt in Synapse:
spark.conf.set(f"fs.azure.account.key.{adls_account_name}.dfs.core.windows.net", adls_account_key)
I want to use the ABFS driver as the docs suggest:
Optimized driver: The ABFS driver is optimized specifically for big data analytics. The corresponding REST APIs are surfaced through the endpoint dfs.core.windows.net.
What does not work:
- When I use pyspark+ABFS and execute in Synapse Notebook, I get a
java.nio.file.AccessDeniedException: Operation failed: "This request is not authorized to perform this operation using this permission.", 403
error.
What works:
- When I use pyspark+WASBS and execute in Synapse Notebook, it works.
- When I use pyspark+ABFS and execute locally from my local PyCharm, it works.
- When I use python/DataLakeServiceClient in Synapse, it works.
- When I use python/ DataLakeServiceClient from my local PyCharm, it works.
It is definitely not a problem of missing permissions but a problem with Synapse. Am I missing some configurations? Any help is appreciated. I'd rather not use the WASB API as (according to this post) ABFS should be used for ADLSGen2.
Every code has the following variables:
adls_account_key = "<myaccountkey>"
adls_container_name = "<mycontainername>"
adls_account_name = "<myaccountname>"
filepath = "/Data/Contacts"
Synapse PySpark ABFS code: (crashes)
spark.conf.set(f"fs.azure.account.key.{adls_account_name}.dfs.core.windows.net", adls_account_key)
base_path = f"abfs://{adls_container_name}@{adls_account_name}.dfs.core.windows.net"
df = spark.read.parquet(base_path + filepath)
df.show(10, False)
Synapse PySpark WASBS code: (works)
spark.conf.set(f"fs.azure.account.key.{adls_account_name}.blob.core.windows.net", adls_account_key)
base_path = f"wasbs://{adls_container_name}@{adls_account_name}.blob.core.windows.net"
df = spark.read.parquet(base_path + filepath)
df.show(10, False)
Synapse + local Python/DataLakeServiceClient code (same on Synapse as on local): (works)
service_client = DataLakeServiceClient(
account_url=f"https://{adls_account_name}.dfs.core.windows.net",
credential=adls_account_key,
)
file_client = service_client.get_file_client(
file_system=adls_container_name, file_path=filepath
)
file_content = file_client.download_file().readall()
Local pyspark ABFS code (includes building a spark session but otherwise the exact same code): (works)
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.config('spark.jars.packages', 'org.apache.hadoop:hadoop-azure:3.3.1') \
.getOrCreate()
spark.conf.set(f"fs.azure.account.key.{adls_account_name}.dfs.core.windows.net", adls_account_key)
base_path = f"abfs://{adls_container_name}@{adls_account_name}.dfs.core.windows.net"
df = spark.read.parquet(base_path + filepath)
df.show(10, False)