0

I am trying to read a file from ADLS Gen2 in Synapse and want to authenticate with the account key.

According to the docs, the following should work but doesnt in Synapse:

spark.conf.set(f"fs.azure.account.key.{adls_account_name}.dfs.core.windows.net", adls_account_key)

I want to use the ABFS driver as the docs suggest:

Optimized driver: The ABFS driver is optimized specifically for big data analytics. The corresponding REST APIs are surfaced through the endpoint dfs.core.windows.net.

What does not work:

  • When I use pyspark+ABFS and execute in Synapse Notebook, I get a java.nio.file.AccessDeniedException: Operation failed: "This request is not authorized to perform this operation using this permission.", 403 error.

What works:

  • When I use pyspark+WASBS and execute in Synapse Notebook, it works.
  • When I use pyspark+ABFS and execute locally from my local PyCharm, it works.
  • When I use python/DataLakeServiceClient in Synapse, it works.
  • When I use python/ DataLakeServiceClient from my local PyCharm, it works.

It is definitely not a problem of missing permissions but a problem with Synapse. Am I missing some configurations? Any help is appreciated. I'd rather not use the WASB API as (according to this post) ABFS should be used for ADLSGen2.

Every code has the following variables:

adls_account_key = "<myaccountkey>"
adls_container_name = "<mycontainername>"
adls_account_name = "<myaccountname>"
filepath = "/Data/Contacts"

Synapse PySpark ABFS code: (crashes)

spark.conf.set(f"fs.azure.account.key.{adls_account_name}.dfs.core.windows.net", adls_account_key)    
base_path = f"abfs://{adls_container_name}@{adls_account_name}.dfs.core.windows.net"
df = spark.read.parquet(base_path + filepath)
df.show(10, False)

Synapse PySpark WASBS code: (works)

spark.conf.set(f"fs.azure.account.key.{adls_account_name}.blob.core.windows.net", adls_account_key)    
base_path = f"wasbs://{adls_container_name}@{adls_account_name}.blob.core.windows.net"
df = spark.read.parquet(base_path + filepath)
df.show(10, False)

Synapse + local Python/DataLakeServiceClient code (same on Synapse as on local): (works)

service_client = DataLakeServiceClient(
    account_url=f"https://{adls_account_name}.dfs.core.windows.net",
    credential=adls_account_key,
)
file_client = service_client.get_file_client(
    file_system=adls_container_name, file_path=filepath 
)
file_content = file_client.download_file().readall()

Local pyspark ABFS code (includes building a spark session but otherwise the exact same code): (works)

from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .config('spark.jars.packages', 'org.apache.hadoop:hadoop-azure:3.3.1') \
    .getOrCreate()

spark.conf.set(f"fs.azure.account.key.{adls_account_name}.dfs.core.windows.net", adls_account_key)

base_path = f"abfs://{adls_container_name}@{adls_account_name}.dfs.core.windows.net"    
df = spark.read.parquet(base_path + filepath)
df.show(10, False)
Cribber
  • 2,513
  • 2
  • 21
  • 60

2 Answers2

0

This post was helpful.

Apparently Synapse restricts the ABFS so it cannot authenticate with the account key. Instead Synapse only allows authentication via the LinkedServices or the service principle. This explains why I can access the ADLS with ABFS from my local PyCharm, but not from within a Synapse Notebook.

Which basically locks any code utlizing ABFS into the infrastructure of Synapse and cannot execute locally anymore unless you want to write all authentications twice (once local with account key, once with Synapse's demanded ServicePrinciple route).

On the other hand, you can use WASB to read from, but not to write to ADLSGen2 because of how renaming operations are implemented on ADLSGen2.

Needless to say, this is beyond stupid. With either protocol I am either locked into Synapse or I cannot write from local. To make my code run both locally and in Synapse I would need WASB+accountKey-auth for reads and ABFS+accountKey(local)+linkedService(synapse) for authentification. The major issue with the latter being that ABFS+linkedServices are a pain and extremely unstable in Synapse, see this post.

Until Microsoft decides to let people use ABFS + the account key for authentication in Synapse I really am stuck between two incredibly stupid and limiting solutions.

Cribber
  • 2,513
  • 2
  • 21
  • 60
-1

You are receiving this due to lack of permissions. You can observe while creating the synapse workspace it do says that we need additional user access roles that needed to be done. One must be assigned to Storage Blob Data Contributor role on the storage account in order to access the adls workspace.

enter image description here

Here are the steps to Grant permissions to managed identity in Synapse workspace

REFERENCES:

SwethaKandikonda
  • 7,513
  • 2
  • 4
  • 18
  • How can this possibly be a permission problem if it works outside of Synapse? Especially as it works with WASBS but not with ABFS? Oh and **I have contributor role for the entire subscription and also added the Synapse service principal to the ADLS.** So this is not it. Anyway, I'm using the account key, why on earth should I need extra permissions?? – Cribber Feb 24 '22 at 07:23
  • Because though you have contributor for your entire subscription, Roles such as Owner, Contributor, and Storage Account Contributor permit a security principal to manage a storage account, but do not provide access to the blob or queue data within that account. – SwethaKandikonda Feb 24 '22 at 09:05
  • Please try checking these attached images for better understanding https://i.imgur.com/FBPNSNh.png , https://i.imgur.com/Zihvcs8.png – SwethaKandikonda Feb 24 '22 at 09:08
  • again... I can access the storage from outside of Azure (local PyCharm IDE) **solely with the account key**, no roles, no additional stuff - why should I suddenly need specific roles when executing the same code from Synapse? – Cribber Feb 24 '22 at 09:16
  • The ABFS and ABFSS schemes target the ADLS Gen 2 REST API, as so it relies on a secret that is rotated, expires, or is deleted, errors can occur, such as 401 Unauthorized. The WASB and WASBS schemes target the Azure Blob Storage REST API, you may use it to access blobs from any HDFS client, including HDInsight. While seeing at the local pyspark ABFS code you are not just using the base path (i.e., f"wasbs://{adls_container_name}@{adls_account_name}.blob.core.windows.net") but you adding configurations too. – SwethaKandikonda Feb 24 '22 at 09:44
  • Hope this document would help you understand a bit more [Access Azure Data Lake Storage Gen2](https://learn.microsoft.com/en-us/azure/databricks/data/data-sources/azure/adls-gen2/azure-datalake-gen2-sp-access#---mount-adls-gen2-storage) on how to mount abfss scheme else you can add the role to mount it directly. – SwethaKandikonda Feb 24 '22 at 09:44
  • I am adding the adding the same configurations that I set in my local pyspark ABFS code in the Synapse ABFS code as well. That is why I am so confused - why should the connection work differently just because I execute the code on Synapse? The only difference between local and Synapse is that I need to build a SparkSession on local, while Synapse supplies me with one. The document you linked uses a service principal as the credentials - but I explicitly want to use the *account key*! – Cribber Feb 24 '22 at 09:59
  • Seems more like this is a bug in Synapse that I cannot authenticate with the account key alone via ABFS.... – Cribber Feb 24 '22 at 10:00
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/242356/discussion-between-swethakandikonda-mt-and-cribber). – SwethaKandikonda Feb 24 '22 at 10:23