How to show all tables in all databases in Databricks

Question

The following can be used to show table in the current schema or a specified schema respectively:

show tables;

show tables in my_schema;

This documented here: https://docs.databricks.com/spark/latest/spark-sql/language-manual/show-tables.html

Is there a way to show all tables in all databases?

Are there metadata tables in Databricks/Spark (similar to the all_ or dba_ tables in Oracle or the information_schema in MySql)? Is there a way to do more specific queries about database objects in Databricks? Something like this:

select * from i_dont_know_what where lower(table_name) like '%gold%' and schema = 'myschema';

Does this answer your question? [How to see all the databases and Tables in Databricks](https://stackoverflow.com/questions/64013666/how-to-see-all-the-databases-and-tables-in-databricks) — jdhao, May 25 '23 at 14:13

Martino Nikolovski · Answer 1 · 2022-07-27T02:05:33.550

8

I tried different ways of obtaining all tables in all schemas but the following is the fastest and most efficient way in Python

databases = [
    db.databaseName 
    for db in spark.sql('show databases').collect()
]
tables = [
    f"{row['database']}.{row['tableName']}" #<schema>.<table> format
    for db_rows in [
        spark.sql(f'show tables in {db}').collect() for db in databases
    ] 
    for row in db_rows
]
print(tables)

Tested it with around 160 schemas that have total of around 3300 tables and it took around 7 seconds on r5n.xlarge cluster.

edited Jul 27 '22 at 02:05

answered Jul 27 '22 at 01:58

Martino Nikolovski

119
1
7

Awesome! Runs like a charm! – Ravi Kiran Gururaja Apr 13 '23 at 14:10

score 4 · Accepted Answer · answered Aug 31 '20 at 02:59

4

Can't you use spark catalog apis on databricks? please try this-

val tuples: Map[String, String] = spark.catalog.listDatabases().collect().flatMap(db =>
      spark.catalog.listTables(db.name).collect().map(x => (db.name, x.name))
    ).toMap

answered Aug 31 '20 at 02:59

Som

6,193
1
11
22

1

This looks like its pulling all of the data and then filtering. I'm currently doing something in Java. I was hoping there is a way to do this in sql. – John Oct 14 '20 at 17:22
Here there is no expression to pull the data from the db. we are just pulling the metadata – Som Oct 15 '20 at 05:45
@Som Could you please give the same code in Python? – Ardalan Shahgholi Nov 04 '20 at 19:55
1

@ArdalanShahgholi I have provided code in Python in a separate answer if you still need it. https://stackoverflow.com/a/73131488/6201129 – Martino Nikolovski Aug 26 '22 at 18:45

Kristo_R · Answer 3 · 2022-02-21T16:36:15.967

I had a similar issue. I wrote a short article about it as well: https://medium.com/helmes-people/how-to-view-all-databases-tables-and-columns-in-databricks-9683b12fee10

The output is a Spark SQL view which holds database name, table name, and column name. This is for all databases, all tables and all columns. You could extend it to have more information. Good part about it, which I needed, was that it lists also the nested columns (StructType).

Edit 2022-02-21: added columns for column dataType and nullable values.

Pyspark code:

from pyspark.sql.types import StructType

# get field name from schema (recursive for getting nested values)
def get_schema_field_name(field, parent=None):
  if type(field.dataType) == StructType:
    if parent == None:
      prt = field.name
    else:
      prt = parent+"."+field.name # using dot notation
    res = []
    for i in field.dataType.fields:
      res.append(get_schema_field_name(i, prt))
    return res
  else:
    if parent==None:
      res = {"name": field.name, "dataType": str(field.dataType), "nullable": field.nullable}
    else:
      res = {"name": parent+"."+field.name, "dataType": str(field.dataType), "nullable": field.nullable}
    return res
  
# flatten list, from https://stackoverflow.com/a/12472564/4920394
def flatten(S):
  if S == []:
    return S
  if isinstance(S[0], list):
    return flatten(S[0]) + flatten(S[1:])
  return S[:1] + flatten(S[1:])

# list of databases
db_list = [x[0] for x in spark.sql("SHOW DATABASES").rdd.collect()]

for i in db_list:
  spark.sql("SHOW TABLES IN {}".format(i)).createOrReplaceTempView(str(i)+"TablesList")

# create a query for fetching all tables from all databases
union_string = "SELECT database, tableName FROM "
for idx, item in enumerate(db_list):
  if idx == 0:
    union_string += str(item)+"TablesList WHERE isTemporary = 'false'"
  else:
    union_string += " UNION ALL SELECT database, tableName FROM {}".format(str(item)+"TablesList WHERE isTemporary = 'false'")
spark.sql(union_string).createOrReplaceTempView("allTables")

# full list = schema, table, column
full_list = []
for i in spark.sql("SELECT * FROM allTables").collect():
  table_name = i[0]+"."+i[1]
  table_schema = spark.table(table_name)
  column_list = []
  for j in table_schema.schema:
    column_list.append(get_schema_field_name(j))
  column_list = flatten(column_list)
  for k in column_list:
    full_list.append([i[0],i[1],k["name"], k["dataType"], k["nullable"]])
spark.createDataFrame(full_list, schema = ['database', 'tableName', 'columnName', "dataType", "nullable"]).createOrReplaceTempView("allColumns")

thanks for sharing.. can you enhance this code to also include datatype in output? — Learn2Code, Feb 15 '22 at 05:06
@Learn2Code sure! Have a look, I updated the code. It now includes dataType and nullable fields. Based on a quick test it looks a bit clumsy for ArrayType, but otherwise seems OK. — Kristo_R, Feb 21 '22 at 16:44

score 3 · Answer 4 · answered Nov 04 '20 at 18:47

3

You can use below code to list all table name in on Database

df = spark.sql("show tables in {}".format("<Your Database Name>"))
display(df)

answered Nov 04 '20 at 18:47

Ardalan Shahgholi

11,967
21
108
144

1

This works to show all of the tables but I would also like to be able to query in manner similar to Oracle or MySql (e.g. select * from all_tables where table_name like 'foo%'). – John Nov 06 '20 at 01:03

score 2 · Answer 5 · answered Jan 19 '23 at 11:06

This code will create an information_schema for you in hive_metastore-

cls = []
spark.sql("Drop view if exists allColumns")

for db in spark.sql("show databases").collect():
    for table in spark.catalog.listTables(f"{db.databaseName}"):
        for column in spark.catalog.listColumns(table.name, table.database):
            cls.append([table.database,table.name, column.name, column.dataType])
            
spark.createDataFrame(cls, schema = ['databaseName','tableName','columnName', 
'columnDataType']).createOrReplaceTempView("allColumns")

spark.sql("""create or replace table default.information_schema as select * from allColumns""")

score 2 · Answer 6 · answered Apr 13 '23 at 14:16

The below code gives an extended information about a table such as location, createdtime, sizeinBytes etc for Databricks tables.

tbl_details = []
# DB Names
databases = [
    db.databaseName 
    for db in spark.sql('show databases').collect()
]
# Table Names
tables = [
    f"{row['database']}.{row['tableName']}"
    for db_rows in [
        spark.sql(f'show tables in {db}').collect() for db in databases
    ] 
    for row in db_rows
]
# Table details
table_details = [{
'id':row['id'],
'name':row['name'],
'description':row['description'],
'location':row['location'],
'createdAt':row['createdAt'],
'lastModified':row['lastModified'],
'partitionColumns':row['partitionColumns'],
'numFiles':row['numFiles'],
'sizeInBytes':row['sizeInBytes'],
'minReaderVersion':row['minReaderVersion'],
'minWriterVersion':row['minWriterVersion']
}
for tbl_rows in [
  spark.sql(f'describe detail {tbl}').collect() for tbl in tables
]
for row in tbl_rows
]
#Print extended table details
print(table_details)

#A Spark DF containing table details
df_all_tbl_details = spark.createDataFrame(table_details)
df_all_tbl_details.show()

How to show all tables in all databases in Databricks

6 Answers6