3

The following can be used to show table in the current schema or a specified schema respectively:

show tables;

show tables in my_schema;

This documented here: https://docs.databricks.com/spark/latest/spark-sql/language-manual/show-tables.html

Is there a way to show all tables in all databases?

Are there metadata tables in Databricks/Spark (similar to the all_ or dba_ tables in Oracle or the information_schema in MySql)? Is there a way to do more specific queries about database objects in Databricks? Something like this:

select * from i_dont_know_what where lower(table_name) like '%gold%' and schema = 'myschema';
John
  • 3,458
  • 4
  • 33
  • 54
  • Does this answer your question? [How to see all the databases and Tables in Databricks](https://stackoverflow.com/questions/64013666/how-to-see-all-the-databases-and-tables-in-databricks) – jdhao May 25 '23 at 14:13

6 Answers6

8

I tried different ways of obtaining all tables in all schemas but the following is the fastest and most efficient way in Python

databases = [
    db.databaseName 
    for db in spark.sql('show databases').collect()
]
tables = [
    f"{row['database']}.{row['tableName']}" #<schema>.<table> format
    for db_rows in [
        spark.sql(f'show tables in {db}').collect() for db in databases
    ] 
    for row in db_rows
]
print(tables)

Tested it with around 160 schemas that have total of around 3300 tables and it took around 7 seconds on r5n.xlarge cluster.

4

Can't you use spark catalog apis on databricks? please try this-

val tuples: Map[String, String] = spark.catalog.listDatabases().collect().flatMap(db =>
      spark.catalog.listTables(db.name).collect().map(x => (db.name, x.name))
    ).toMap
Som
  • 6,193
  • 1
  • 11
  • 22
4

I had a similar issue. I wrote a short article about it as well: https://medium.com/helmes-people/how-to-view-all-databases-tables-and-columns-in-databricks-9683b12fee10

The output is a Spark SQL view which holds database name, table name, and column name. This is for all databases, all tables and all columns. You could extend it to have more information. Good part about it, which I needed, was that it lists also the nested columns (StructType).


Edit 2022-02-21: added columns for column dataType and nullable values.


Pyspark code:

from pyspark.sql.types import StructType

# get field name from schema (recursive for getting nested values)
def get_schema_field_name(field, parent=None):
  if type(field.dataType) == StructType:
    if parent == None:
      prt = field.name
    else:
      prt = parent+"."+field.name # using dot notation
    res = []
    for i in field.dataType.fields:
      res.append(get_schema_field_name(i, prt))
    return res
  else:
    if parent==None:
      res = {"name": field.name, "dataType": str(field.dataType), "nullable": field.nullable}
    else:
      res = {"name": parent+"."+field.name, "dataType": str(field.dataType), "nullable": field.nullable}
    return res
  
# flatten list, from https://stackoverflow.com/a/12472564/4920394
def flatten(S):
  if S == []:
    return S
  if isinstance(S[0], list):
    return flatten(S[0]) + flatten(S[1:])
  return S[:1] + flatten(S[1:])

# list of databases
db_list = [x[0] for x in spark.sql("SHOW DATABASES").rdd.collect()]

for i in db_list:
  spark.sql("SHOW TABLES IN {}".format(i)).createOrReplaceTempView(str(i)+"TablesList")

# create a query for fetching all tables from all databases
union_string = "SELECT database, tableName FROM "
for idx, item in enumerate(db_list):
  if idx == 0:
    union_string += str(item)+"TablesList WHERE isTemporary = 'false'"
  else:
    union_string += " UNION ALL SELECT database, tableName FROM {}".format(str(item)+"TablesList WHERE isTemporary = 'false'")
spark.sql(union_string).createOrReplaceTempView("allTables")

# full list = schema, table, column
full_list = []
for i in spark.sql("SELECT * FROM allTables").collect():
  table_name = i[0]+"."+i[1]
  table_schema = spark.table(table_name)
  column_list = []
  for j in table_schema.schema:
    column_list.append(get_schema_field_name(j))
  column_list = flatten(column_list)
  for k in column_list:
    full_list.append([i[0],i[1],k["name"], k["dataType"], k["nullable"]])
spark.createDataFrame(full_list, schema = ['database', 'tableName', 'columnName', "dataType", "nullable"]).createOrReplaceTempView("allColumns")
Kristo_R
  • 167
  • 1
  • 13
  • thanks for sharing.. can you enhance this code to also include datatype in output? – Learn2Code Feb 15 '22 at 05:06
  • 1
    @Learn2Code sure! Have a look, I updated the code. It now includes dataType and nullable fields. Based on a quick test it looks a bit clumsy for ArrayType, but otherwise seems OK. – Kristo_R Feb 21 '22 at 16:44
3

You can use below code to list all table name in on Database

df = spark.sql("show tables in {}".format("<Your Database Name>"))
display(df)
Ardalan Shahgholi
  • 11,967
  • 21
  • 108
  • 144
  • 1
    This works to show all of the tables but I would also like to be able to query in manner similar to Oracle or MySql (e.g. select * from all_tables where table_name like 'foo%'). – John Nov 06 '20 at 01:03
2

This code will create an information_schema for you in hive_metastore-

cls = []
spark.sql("Drop view if exists allColumns")

for db in spark.sql("show databases").collect():
    for table in spark.catalog.listTables(f"{db.databaseName}"):
        for column in spark.catalog.listColumns(table.name, table.database):
            cls.append([table.database,table.name, column.name, column.dataType])
            
spark.createDataFrame(cls, schema = ['databaseName','tableName','columnName', 
'columnDataType']).createOrReplaceTempView("allColumns")

spark.sql("""create or replace table default.information_schema as select * from allColumns""")
2

The below code gives an extended information about a table such as location, createdtime, sizeinBytes etc for Databricks tables.

tbl_details = []
# DB Names
databases = [
    db.databaseName 
    for db in spark.sql('show databases').collect()
]
# Table Names
tables = [
    f"{row['database']}.{row['tableName']}"
    for db_rows in [
        spark.sql(f'show tables in {db}').collect() for db in databases
    ] 
    for row in db_rows
]
# Table details
table_details = [{
'id':row['id'],
'name':row['name'],
'description':row['description'],
'location':row['location'],
'createdAt':row['createdAt'],
'lastModified':row['lastModified'],
'partitionColumns':row['partitionColumns'],
'numFiles':row['numFiles'],
'sizeInBytes':row['sizeInBytes'],
'minReaderVersion':row['minReaderVersion'],
'minWriterVersion':row['minWriterVersion']
}
for tbl_rows in [
  spark.sql(f'describe detail {tbl}').collect() for tbl in tables
]
for row in tbl_rows
]
#Print extended table details
print(table_details)

#A Spark DF containing table details
df_all_tbl_details = spark.createDataFrame(table_details)
df_all_tbl_details.show()