7

I am new to azure databricks and trying to create an external table, pointing to Azure Data Lake Storage (ADLS) Gen-2 location.

From databricks notebook i have tried to set the spark configuration for ADLS access. Still i am unable to execute the DDL created.

Note: One solution working for me is mounting the ADLS account to cluster and then use the mount location in external table's DDL. But i needed to check if it is possible to create a external table DDL with ADLS path without mount location.

# Using Principal credentials
spark.conf.set("dfs.azure.account.auth.type", "OAuth")
spark.conf.set("dfs.azure.account.oauth.provider.type", "ClientCredential")
spark.conf.set("dfs.azure.account.oauth2.client.id", "client_id")
spark.conf.set("dfs.azure.account.oauth2.client.secret", "client_secret")
spark.conf.set("dfs.azure.account.oauth2.client.endpoint", 
"https://login.microsoftonline.com/tenant_id/oauth2/token")

DDL

create external table test(
id string,
name string
)
partitioned by (pt_batch_id bigint, pt_file_id integer)
STORED as parquet
location 'abfss://container@account_name.dfs.core.windows.net/dev/data/employee

Error Received

Error in SQL statement: AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Got exception: shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.contracts.exceptions.ConfigurationPropertyNotFoundException Configuration property account_name.dfs.core.windows.net not found.);

I need help in knowing if this is possible to refer to ADLS location directly in DDL?

Thanks.

anurag
  • 590
  • 3
  • 8
  • 27
  • have you verified all your values for tenant_id, client_id and client_secret are correct and the service principle has the required permissions? – silent Jun 27 '19 at 14:09
  • Yes, because after setting up the spark configurations i am able to read the file in a data frame and use it. – anurag Jun 27 '19 at 14:51
  • looks like only way is to mount the ADLS Gen2 account. – anurag Jun 27 '19 at 20:39
  • Error in SQL statement: AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Got exception: shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.contracts.exceptions.ConfigurationPropertyNotFoundException Configuration property account_name.dfs.core.windows.net not found.); – anurag Jul 08 '19 at 11:38

1 Answers1

4

Sort of if you can use Python (or Scala).

Start by making the connection:

TenantID = "blah"

def connectLake():
  spark.conf.set("fs.azure.account.auth.type", "OAuth")
  spark.conf.set("fs.azure.account.oauth.provider.type", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
  spark.conf.set("fs.azure.account.oauth2.client.id", dbutils.secrets.get(scope = "LIQUIX", key = "lake-sp"))
  spark.conf.set("fs.azure.account.oauth2.client.secret", dbutils.secrets.get(scope = "LIQUIX", key = "lake-key"))
  spark.conf.set("fs.azure.account.oauth2.client.endpoint", "https://login.microsoftonline.com/"+TenantID+"/oauth2/token")

connectLake()
lakePath = "abfss://liquix@mystorageaccount.dfs.core.windows.net/"

Using Python you can register a table using:

spark.sql("CREATE TABLE DimDate USING PARQUET LOCATION '"+lakePath+"/PRESENTED/DIMDATE/V1'")

You can now query that table if you have executed the connectLake() function - which is fine in your current session/notebook.

The problem is now if a new session comes in and they try select * from that table it will fail unless they run the connectLake() function first. There is no way around that limitation as you have to prove credentials to access the lake.

You may want to consider ADLS Gen2 credential pass through: https://docs.azuredatabricks.net/spark/latest/data-sources/azure/adls-passthrough.html

Note that this requires using a High Concurrency cluster.

simon_dmorias
  • 2,343
  • 3
  • 19
  • 33
  • 1
    Does not work for ADLS gen2. This method is working for Gen1 storage accounts. – anurag Jul 08 '19 at 09:04
  • I tested the above on gen2, have you? If so please describe the error. – simon_dmorias Jul 08 '19 at 10:23
  • 1
    Error in SQL statement: AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Got exception: shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.contracts.exceptions.ConfigurationPropertyNotFoundException Configuration property account_name.dfs.core.windows.net not found.); – anurag Jul 08 '19 at 15:09
  • That session has not run the connectLake() function. As explained every session needs to run it. – simon_dmorias Jul 08 '19 at 15:27
  • 1
    do we need to run connectLake() function even if we run the ddl in same spark session? – anurag Jul 11 '19 at 10:14