2

How can I create an EXTERNAL TABLE in Azure Databricks which reads from Azure Data Lake Store? I am having trouble seeing in the documentation if it is even possible. I have a set of CSV files in a specific folder in Azure Data lake Store, and I want to do a CREATE EXTERNAL TABLE in Azure Databricks which points to the CSV files.

Hauke Mallow
  • 2,887
  • 3
  • 11
  • 29
datarocker
  • 31
  • 1
  • 6

2 Answers2

3

1. Reference mounted directories

You can mount the Azure Data Lake Store (ADLS) to Azure Databricks DBFS (requires 4.0 runtime or higher):

    # Get Azure Data Lake Store credentials from the secret store
    clientid = dbutils.preview.secret.get(scope = "adls", key = "clientid")
    credential = dbutils.preview.secret.get(scope = "adls", key = "credential")
    refreshurl = dbutils.preview.secret.get(scope = "adls", key = "refreshurl")
     accounturl = dbutils.preview.secret.get(scope = "adls", key = "accounturl")

    # Mount the ADLS
    configs = {"dfs.adls.oauth2.access.token.provider.type": "ClientCredential",
       "dfs.adls.oauth2.client.id": clientid,
       "dfs.adls.oauth2.credential": credential,
       "dfs.adls.oauth2.refresh.url": refreshurl}

    dbutils.fs.mount(
       source = accounturl,
       mount_point = "/mnt/adls",
       extra_configs = configs)

Table creation works the same way as with DBFS. Just reference the mountpoint with the directory in ADLS, e. g.:

    %sql 
    CREATE TABLE product
    USING CSV
    OPTIONS (header "true", inferSchema "true")
    LOCATION "/mnt/adls/productscsv/"

The location clause automatically implies EXTERNAL. See also Azure Databricks Documentation.

2. Reference the Data Lake Store in the table definition directly

You can also reference the storage directly without mounting the storage. This scenario makes sense if the metadata or parts of the code are also used in other platforms. In this scenario access to the storage has to be defined on the cluster or notebook level (see this Databricks documentation for ADLS Gen1 or this documentation for Gen2 configuration details) or Azure AD Credential Passthrough is used. The table definition would look like this for ADLS Gen1:

CREATE TABLE sampletable
(L_ORDERKEY BIGINT,
 L_PARTKEY BIGINT,
 L_SUPPKEY BIGINT,
 L_SHIPMODE STRING,
 L_COMMENT STRING)
 USING csv
 OPTIONS ('DELIMITER' '|')
 LOCATION "adl://<your adls>.azuredatalakestore.net/directory1/sampletable"
; 

For Azure Data Lake Gen2 the location reference looks like:

 LOCATION "abfss://<file_system>@<account_name.dfs.core.windows.net/directory/tablename"
Hauke Mallow
  • 2,887
  • 3
  • 11
  • 29
  • is there a solution without mounting the ADLS to dbfs? – anurag Jun 27 '19 at 16:32
  • Thanks for the info, i tried doing the same using ADLS gen2 using principal credentials. but it seems to work only for gen1 ADLS. https://stackoverflow.com/questions/56792095/create-external-table-in-azure-databricks – anurag Jun 27 '19 at 19:02
  • You have used something like this for the location in ADLS Gen2: abfss://@.dfs.core.windows.net/dir/sampletable? – Hauke Mallow Jun 27 '19 at 19:07
  • yes, same has been mentioned in link i have mentioned in above comment. – anurag Jun 27 '19 at 20:36
0

you should consider looking at this link: https://docs.azuredatabricks.net/spark/latest/data-sources/azure/azure-datalake.html

Access Azure Data Lake Store using the Spark API To read from your Data Lake Store account, you can configure Spark to use service credentials with the following snippet in your notebook:

spark.conf.set("dfs.adls.oauth2.access.token.provider.type", "ClientCredential") spark.conf.set("dfs.adls.oauth2.client.id", "{YOUR SERVICE CLIENT ID}") spark.conf.set("dfs.adls.oauth2.credential", "{YOUR SERVICE CREDENTIALS}") spark.conf.set("dfs.adls.oauth2.refresh.url", "https://login.microsoftonline.com/{YOUR DIRECTORY ID}/oauth2/token")

It doesn't mention the use of External Table.

  • look at the question first! This method is posted on thousand of blogs and forums. – anurag Jun 27 '19 at 20:38
  • Feel free to contribute by adding an answer to this question if the above is not suitable for your case. Note that this answer is more than a year old, which could become irrelevant. The goal in stackoverflow is to share info and guidance. If you have good guidance or answer, please post it to help others. – Adam Smith - Microsoft Azure Jun 27 '19 at 22:36