Collect Hive column statistics in Apache Spark

Question

I use Pyspark 2.1 to compute table and column statistics out of hive tables.

When I do:

spark.sql("ANALYZE TABLE <table name> COMPUTES STATISTICS")

I am able to collect the stats with a DESCRIBE EXTENDED SQL command:

spark.sql("DESCRIBE EXTENDED <table name>").show()

However, when computing column statistics like so:

spark.sql("ANALYZE TABLE <table name> COMPUTES STATISTICS FOR COLUMNS")

The job gets sent and successfully done, but I am unable to collect the stats using the SQL command as advised by this answer:

spark.sql("DESCRIBE EXTENDED <table name> <column name>").show()

I get :

ParseException Unsupported SQL statement

Reissuing the same DESCRIBE EXTENDED query (without the column name) does not return any changes in the CatalogTable : I can only see the table statistics (i.e. sizeInBytes and rowCount).

This other answer suggests to retrieve Spark statistics from a "table" in the metastore, but this is cryptic to me...

How can I access these column-level statistics within Spark ?

Edit : I have investigated this further which allows me to refine the scope of my question :

It looks like my Hive client (Hive View 2.0) and Spark SQL do not write the statistics in the same location.

When using Spark SQL's ANALYZE TABLE method, -only- table statistics show up in a Statistics key of the CatalogTable:
```
 Statistics: sizeInBytes=15196959528, rowCount=39763901, isBroadcastable=false
```

However, Hive View is blind to these stats, which are not listed in my table statistics

Conversely, when computing table or columns statistics within Hive View, I can collect table statistics with Spark SQL's DESCRIBE EXTENDED method, but they appear in the Properties key of my CatalogTable. It also indicates if some column statistics have been computed :
```
Properties: [numFiles=20, transient_lastDdlTime=1540381765, totalSize=15196959528, COLUMN_STATS_ACCURATE={"COLUMN_STATS":{"bubble_level":"true","bubble_level_n1s":"true","timestamp":"true"}}]
```

Thus these pieces of information appear to be independent, and my question then becomes : which piece can be used by Spark to optimize execution plan ? I understand that some of these questions could be solved by upgrading to latest version of Spark, but this is not on my schedule for the moment.

The above-mentioned answer of Jacek Laskowski suggests Hive's statistics can be used if they are available through Spark SQL DESCRIBE EXTENDED method.

Can anybody confirm this ?

Many thanks in advance for helping me to clear this up.

I would suggest you to use DataFrame API, e.g., `df = spark.read.table("")`, then `df.select("").describe()`. — serge_k, Oct 24 '18 at 05:24
Serge, I am using the hiveQL API because I don't only want to retrieve the stats : I want to store them in the metastore and benchmark the effect of pre-computing Hive statistics on my Spark jobs. Basically I intend to empirically address this [unsettled question](https://stackoverflow.com/q/46950007/4696032). — Benjamin, Oct 24 '18 at 08:30

Collect Hive column statistics in Apache Spark

0 Answers0