3

There are important metadata registered in the tables by

create table t1 (
  column1 type COMMENT '... the comments1...',
  column2 type COMMENT '... the comments2...'
  ...
);

in our database hundreds of tables, and hundreds of column's COMMENT for each table.

We need Spark or HQL statement to obtain SQL column comments.

How to do some serious information retrieval about it? I need it as a first class table,... Or, if it is impossible, a JSON or standard MIME CSV file with all metadata.


NOTES:

  • We use both, direct Hive SQL and Spark, so any answer (HQL or Scala/Spark script) is valid.

  • Not need to scrape SHOW/DESCRIBE TABLES command, need serious solution.

  • This question is similar, but restricted to MySQL... The only clue is to do some direct acess to the Metastore (in our config seems Derby SQL schema - no standard).

  • This other question also about "serious information retrieval" of metadata.

Peter Krauss
  • 13,174
  • 24
  • 167
  • 304

2 Answers2

4

You can retrieve the comments through the DataFrame schema using StructField.getComment()

For example if you want to manipulate the comments as a dataframe

// create a demo table
spark.sql("create table t1 (id long comment 'this is an example comment')")

// load table as dataframe
val t1 = spark.table("t1")

// load column comments as dataframe
val commentsDf1 = t1.schema.map(f => (f.name,f.getComment)).toDF("name","comment")

// check that everything is loaded correctly
commentsDf1.show(false)
+----+--------------------------+
|name|comment                   |
+----+--------------------------+
|id  |this is an example comment|
+----+--------------------------+

If you want to use SQL to access these comments, you can use DESCRIBE <tablename> to achieve the same result:

val commentsDf2 = spark.sql("describe t1")

commentsDf2.show(false)
+--------+---------+--------------------------+
|col_name|data_type|comment                   |
+--------+---------+--------------------------+
|id      |bigint   |this is an example comment|
+--------+---------+--------------------------+

Both, commentsDf1 and commentsDf2, are first-class citizens: commentsDf2 is tabular and commentsDf1 a little bit more richer and complex.

Peter Krauss
  • 13,174
  • 24
  • 167
  • 304
rluta
  • 6,717
  • 1
  • 19
  • 21
  • Thanks! Seems an undocumented Spark feature, [getComment](https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/types/StructField.html#getComment--)... Do you suggest a link or tutorial about advanced uses of the *.schema* and schema utils? – Peter Krauss Sep 09 '19 at 21:34
  • I don't know if any exists actually. – rluta Sep 10 '19 at 08:04
  • It seems that there is no "getComment" option for Python, is that correct? – Siete Nov 09 '22 at 09:14
  • Correct but you can access (and modify) the comment using the metadata property of StructField: `StructField.metatadata.get('comment')` in Pyspark is equivalent to Scala `StructField.getComment()` – rluta Nov 09 '22 at 10:12
2

In case you use PySpark, you can do the following:

for col_schema in df.schema:
    print(col_schema.metadata['comment'])
Siete
  • 328
  • 3
  • 14