0

I need all the table columns at a time which present in the particular DB in Azure Data bricks. I know the approach to find the sql server by using the following query. I need same kind of implementation in databricks also.

select schema_name(tab.schema_id) as schema_name,
    tab.name as table_name, 
    col.column_id,
    col.name as column_name, 
    t.name as data_type,    
    col.max_length,
    col.precision
from sys.tables as tab
    inner join sys.columns as col
        on tab.object_id = col.object_id
    left join sys.types as t
    on col.user_type_id = t.user_type_id
order by schema_name,
    table_name, 
    column_id;
Satya
  • 53
  • 1
  • 6

2 Answers2

3

Actually there are many way to achieve this, one is to play with some basic python dataframe, list and loop(Sorry for my poor English)

Let's play....

1st you have to retrieve all table name and with those table name retrieve table description that contain all column name with data type. we use 2 spark sql query

1: Table_name = spark.sql("SHOW TABLES FROM default")

(default databricks default database name)

result

+--------+--------------+-----------+ |database| tableName|isTemporary| +--------+--------------+-----------+ | default|convertedtable| false| | default| delta_table| false| | default| display| false| | default| events| false| | default| export_csv| false| | default| flights| false| | default| play_store| false| +--------+--------------+-----------+

2: table_describe = spark.sql("DESCRIBE default.play_store")

result

+--------------+---------+-------+ | col_name|data_type|comment| +--------------+---------+-------+ | App| string| null| | Category| string| null| | Rating| string| null| | Reviews| string| null| | Size| string| null| | Installs| string| null| | Type| string| null| | Price| string| null| |Content Rating| string| null| | Genres| string| null| | Last Updated| string| null| | Current Ver| string| null| | Android Ver| string| null| +--------------+---------+-------+

Now the main part

Table_name = spark.sql("SHOW TABLES FROM default")
Table_name=Table_name.select('tableName').collect()
mvv_array = [(row.tableName) for row in Table_name]
cnt = 0
for idx,i in enumerate(mvv_array):
  if cnt == 0:
    tmp = spark.sql('DESCRIBE default.'+i).withColumn('table_name',lit(i))
    temp_df = tmp
    cnt = cnt + 1
  else:
    tmp = spark.sql('DESCRIBE default.'+i).withColumn('table_name',lit(i))
    temp_df=temp_df.unionAll(tmp)
temp_df.show()

TADAAAA.....

+-----------------+---------+-------+--------------+ | col_name|data_type|comment| table_name| +-----------------+---------+-------+--------------+ | Year| int| null|convertedtable| | Month| int| null|convertedtable| | DayofMonth| int| null|convertedtable| | DayOfWeek| int| null|convertedtable| | DepTime| string| null|convertedtable| | CRSDepTime| int| null|convertedtable| | ArrTime| string| null|convertedtable| | CRSArrTime| int| null|convertedtable| | UniqueCarrier| string| null|convertedtable| | FlightNum| int| null|convertedtable| | TailNum| string| null|convertedtable| |ActualElapsedTime| string| null|convertedtable| | CRSElapsedTime| string| null|convertedtable| | AirTime| string| null|convertedtable| | ArrDelay| string| null|convertedtable| | DepDelay| string| null|convertedtable| | Origin| string| null|convertedtable| | Dest| string| null|convertedtable| | Distance| int| null|convertedtable| | TaxiIn| string| null|convertedtable| +-----------------+---------+-------+--------------+ only showing top 20 rows

Take what ever you need from new dataframe.

Hope its fulfill what you want.

If you find it as your solution don't forget to mark as Answer and up vote.

Sohel Reza
  • 281
  • 1
  • 6
  • 23
  • Hi, Here we are getting only top 20 rows. How to get all the rows. Please can you tell. – Satya Feb 21 '20 at 11:26
  • Actually you are not getting only 20 rows dataframe just showing top 20 rows. deep inside you have all data. – Sohel Reza Feb 22 '20 at 12:11
  • How can i download all the rows in the form of excel. Because i want to see all the rows. – Satya Feb 22 '20 at 13:44
  • 1
    Thank you so much. I got all the columsn. I used the diplay(temp_df). It is directly showing the download option. – Satya Feb 22 '20 at 14:30
  • Missing: from pyspark.sql.functions import lit – PiC Oct 26 '21 at 16:42
  • "withColumn" gives an error, can you help to resolve? Resolved attribute(s) col_name#275276,data_type#275277,comment#275278 missing from col_name#275193,data_type#275194,comment#275195 in operator !Project [col_name#275276, data_type#275277, comment#275278, open_order_fct AS table_name#275279]. Attribute(s) with the same name appear in the operation: col_name,data_type,comment. Please check if the right attribute(s) are used.; – PiC Oct 26 '21 at 16:43
  • If I remove "withColumn" then I get an error that "alltables" does not exist. Can you help resolving? – PiC Oct 26 '21 at 16:45
2

The answer by Sohel Reza is good, but it doesn't solve it in case you want to get nested columns (StructType). It only shows the highest level column name.

If you want a simple view which lists all the databases, tables, and columns, I've written about one such solution here: https://medium.com/helmes-people/how-to-view-all-databases-tables-and-columns-in-databricks-9683b12fee10

It creates a view with database, table name and column name.

You can also copy/paste the full code from below:

from pyspark.sql.types import StructType

# get field name from schema (recursive for getting nested values)
def get_schema_field_name(field, parent=None):
  if type(field.dataType) == StructType:
    if parent == None:
      prt = field.name
    else:
      prt = parent+"."+field.name # using dot notation
    res = []
    for i in field.dataType.fields:
      res.append(get_schema_field_name(i, prt))
    return res
  else:
    if parent==None:
      res = field.name
    else:
      res = parent+"."+field.name
    return res
  
# flatten list, from https://stackoverflow.com/a/12472564/4920394
def flatten(S):
  if S == []:
    return S
  if isinstance(S[0], list):
    return flatten(S[0]) + flatten(S[1:])
  return S[:1] + flatten(S[1:])

# list of databases
db_list = [x[0] for x in spark.sql("SHOW DATABASES").rdd.collect()]

for i in db_list:
  spark.sql("SHOW TABLES IN {}".format(i)).createOrReplaceTempView(str(i)+"TablesList")

# create a query for fetching all tables from all databases
union_string = "SELECT database, tableName FROM "
for idx, item in enumerate(db_list):
  if idx == 0:
    union_string += str(item)+"TablesList WHERE isTemporary = 'false'"
  else:
    union_string += " UNION ALL SELECT database, tableName FROM {}".format(str(item)+"TablesList WHERE isTemporary = 'false'")
spark.sql(union_string).createOrReplaceTempView("allTables")

# full list = schema, table, column
full_list = []
for i in spark.sql("SELECT * FROM allTables").collect():
  table_name = i[0]+"."+i[1]
  table_schema = spark.sql("SELECT * FROM {}".format(table_name))
  column_list = []
  for j in table_schema.schema:
    column_list.append(get_schema_field_name(j))
  column_list = flatten(column_list)
  for k in column_list:
    full_list.append([i[0],i[1],k])
spark.createDataFrame(full_list, schema = ['database', 'tableName', 'columnName']).createOrReplaceTempView("allColumns")```
Kristo_R
  • 167
  • 1
  • 13