8

Let's say, the datalake is on AWS. Using S3 as storage and Glue as data catalog. So, we can easily use athena, redshift or EMR to query data on S3 using Glue as metastore.

My question is, is it possible to expose Glue data catalog as metastore for external services like Databricks hosted on AWS ?

Obaid
  • 237
  • 2
  • 14

2 Answers2

3

Now Databricks provides documentation to make Glue Data Catalog as the Metastore. It should be done following these steps:

  1. Create an IAM role and policy to access a Glue Data Catalog
  2. Create a policy for the target Glue Catalog
  3. Look up the IAM role used to create the Databricks deployment
  4. Add the Glue Catalog IAM role to the EC2 policy
  5. Add the Glue Catalog IAM role to a Databricks workspace
  6. Launch a cluster with the Glue Catalog IAM role

Reference: https://docs.databricks.com/data/metastores/aws-glue-metastore.html.

matiasm
  • 101
  • 5
1

There'd been couple of decent documentation/writeup pieces provided by Databricks (see the docs and the blog post), though they cover custom/legacy Hive metastore integration, not Glue itself.

Also - as a Plan B - it should be possible to inspect table/partition definitions you have in Databricks metastore and do one-way replication to Glue through the Java SDK (or maybe the other way around as well, mapping AWS API responses to sequences of create table / create partition statements). Of course this is ridden with rather complex corner cases, like cascading partition/table deletions and such, but for some simple create-only stuff it seems to be approachable at least.

Anton Kraievyi
  • 4,182
  • 4
  • 26
  • 41