AWS Glue Data Catalog as Metastore for external services like Databricks

Question

Let's say, the datalake is on AWS. Using S3 as storage and Glue as data catalog. So, we can easily use athena, redshift or EMR to query data on S3 using Glue as metastore.

My question is, is it possible to expose Glue data catalog as metastore for external services like Databricks hosted on AWS ?

score 3 · Answer 1 · answered Jan 10 '20 at 22:07

Now Databricks provides documentation to make Glue Data Catalog as the Metastore. It should be done following these steps:

Create an IAM role and policy to access a Glue Data Catalog
Create a policy for the target Glue Catalog
Look up the IAM role used to create the Databricks deployment
Add the Glue Catalog IAM role to the EC2 policy
Add the Glue Catalog IAM role to a Databricks workspace
Launch a cluster with the Glue Catalog IAM role

Reference: https://docs.databricks.com/data/metastores/aws-glue-metastore.html.

score 1 · Answer 2 · answered Jun 26 '18 at 10:38

There'd been couple of decent documentation/writeup pieces provided by Databricks (see the docs and the blog post), though they cover custom/legacy Hive metastore integration, not Glue itself.

Also - as a Plan B - it should be possible to inspect table/partition definitions you have in Databricks metastore and do one-way replication to Glue through the Java SDK (or maybe the other way around as well, mapping AWS API responses to sequences of create table / create partition statements). Of course this is ridden with rather complex corner cases, like cascading partition/table deletions and such, but for some simple create-only stuff it seems to be approachable at least.

AWS Glue Data Catalog as Metastore for external services like Databricks

2 Answers2