2

I am trying to mount ADLS Gen 2 from my databricks Community Edition, but when I run the following code:

test = spark.read.csv("/mnt/lake/RAW/csds.csv", inferSchema=True, header=True)

I get the error:

com.databricks.rpc.UnknownRemoteException: Remote exception occurred:

I'm using the following code to mount ADLS Gen 2

def check(mntPoint):
  a= []
  for test in dbutils.fs.mounts():
    a.append(test.mountPoint)
  result = a.count(mntPoint)
  return result

mount = "/mnt/lake"

if check(mount)==1:
  resultMsg = "<div>%s is already mounted. </div>" % mount
else:
  dbutils.fs.mount(
  source = "wasbs://root@adlspretbiukadlsdev.blob.core.windows.net",
  mount_point = mount,
  extra_configs = {"fs.azure.account.key.adlspretbiukadlsdev.blob.core.windows.net":""})
  resultMsg = "<div>%s was mounted. </div>" % mount

displayHTML(resultMsg)


ServicePrincipalID = 'xxxxxxxxxxx'
ServicePrincipalKey = 'xxxxxxxxxxxxxx'
DirectoryID =  'xxxxxxxxxxxxxxx'
Lake =  'adlsgen2'


# Combine DirectoryID into full string
Directory = "https://login.microsoftonline.com/{}/oauth2/token".format(DirectoryID)

# Create configurations for our connection
configs = {"fs.azure.account.auth.type": "OAuth",
           "fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
           "fs.azure.account.oauth2.client.id": ServicePrincipalID,
           "fs.azure.account.oauth2.client.secret": ServicePrincipalKey,
           "fs.azure.account.oauth2.client.endpoint": Directory}



mount = "/mnt/lake"

if check(mount)==1:
  resultMsg = "<div>%s is already mounted. </div>" % mount
else:
  dbutils.fs.mount(
  source = f"abfss://root@{Lake}.dfs.core.windows.net/",
  mount_point = mount,
  extra_configs = configs)
  resultMsg = "<div>%s was mounted. </div>" % mount

I then try to read a dataframe in ADLS Gen 2 using the following:

dataPath = "/mnt/lake/RAW/DummyEventData/CommerceTools/"

test = spark.read.csv("/mnt/lake/RAW/csds.csv", inferSchema=True, header=True)

com.databricks.rpc.UnknownRemoteException: Remote exception occurred:

Any ideas?

Alex Ott
  • 80,552
  • 8
  • 87
  • 132
Patterson
  • 1,927
  • 1
  • 19
  • 56
  • please post the whole stacktrace – Alex Ott May 16 '21 at 09:07
  • Hi @AlexOtt, do you mean ```/databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name) 324 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client) 325 if answer[1] == REFERENCE_TYPE: --> 326 raise Py4JJavaError( 327 "An error occurred while calling {0}{1}{2}.\n". 328 format(target_id, ".", name), value)`` – Patterson May 16 '21 at 09:13
  • Or did you mean ``` Py4JJavaError Traceback (most recent call last) in ----> 1 test = spark.read.csv("/mnt/lake/RAW/csds.csv", inferSchema=True, header=True) /databricks/spark/python/pyspark/sql/readwriter.py in csv(self, path, schema, sep, encoding, quote, escape, comment, 762 path = [path] 763 if type(path) == list: --> 764 return self._df(self._jreader.csv(self._spark._sc._jvm.PythonUtils.toSeq(path))) ``` – Patterson May 16 '21 at 09:16
  • yes, include the error message from JVM - usually there should be a line, "Caused by" – Alex Ott May 16 '21 at 09:19
  • I suspect that it could be caused by the security model of the community edition that is different from the "normal" Databricks – Alex Ott May 16 '21 at 09:20
  • Hi @AlexOtt, that is what I was thinking. But I wanted to know for sure, before I started troubleshooting – Patterson May 16 '21 at 09:29
  • @AlexOtt, just so you know, there isn't a line "Caused by" – Patterson May 16 '21 at 09:45
  • Anyway it's hard to say without full stacktrace – Alex Ott May 16 '21 at 11:09
  • Hi @AlexOtt, I'm not sure how to get you the full stracktrace? SO will only allow a certain number of characters – Patterson May 17 '21 at 12:23
  • Put it into https://gist.github.com or something like and link it from post – Alex Ott May 17 '21 at 12:26
  • Hi @AlexOtt, I have done this before with github, but here you https://gist.github.com/cpatte7372/f9a820e82c5e57befa919430b1b9af45 Let me know if you can access it? Thanks – Patterson May 18 '21 at 10:23
  • @AlexOtt, so I assigned the Service Principle with Storage Blob Data Contributor. Now, I'm able to read in the CSV using: ```test2 = spark.read.csv("abfss://root@adlspretbiukadlsdev.dfs.core.windows.net/RAW/csds.csv",inferSchema=True,header=True)``` But I'm still getting the error when reading in the same CSV with: ```test = spark.read.csv("/mnt/lake/RAW/csds.csv", inferSchema=True, header=True)``` – Patterson May 18 '21 at 12:12
  • @AlexOtt, I re-added the code to https://gist.github.com/cpatte7372/f9a820e82c5e57befa919430b1b9af45 again just in case you have to check it out – Patterson May 18 '21 at 12:19
  • Hi @AlexOtt, did you get a chance to take another look at the code? – Patterson May 18 '21 at 20:07
  • I don’t know exactly, but I suspect something specific to community edition. I suggest just use full abfss url instead of mount - community edition isn’t the same as standard databricks – Alex Ott May 18 '21 at 20:58
  • @AlexOtt thats what I thought. Thanks – Patterson May 19 '21 at 08:03

1 Answers1

2

Based on the stacktrace, most probably reason for that error is that you don't have Storage Blob Data Contributor (or Storage Blob Data Reader) role assigned for your service principal (as it's described in documentation). This role is different from usual "Contributor" role, and that's very confusing.

Alex Ott
  • 80,552
  • 8
  • 87
  • 132