I use Pyspark 2.1
to compute table and column statistics out of hive tables.
When I do:
spark.sql("ANALYZE TABLE <table name> COMPUTES STATISTICS")
I am able to collect the stats with a DESCRIBE EXTENDED
SQL command:
spark.sql("DESCRIBE EXTENDED <table name>").show()
However, when computing column statistics like so:
spark.sql("ANALYZE TABLE <table name> COMPUTES STATISTICS FOR COLUMNS")
The job gets sent and successfully done, but I am unable to collect the stats using the SQL command as advised by this answer:
spark.sql("DESCRIBE EXTENDED <table name> <column name>").show()
I get :
ParseException Unsupported SQL statement
Reissuing the same DESCRIBE EXTENDED
query (without the column name) does not
return any changes in the CatalogTable : I can only see the table statistics (i.e. sizeInBytes and rowCount).
This other answer suggests to retrieve Spark statistics from a "table" in the metastore, but this is cryptic to me...
How can I access these column-level statistics within Spark ?
Edit : I have investigated this further which allows me to refine the scope of my question :
It looks like my Hive client (Hive View 2.0) and Spark SQL do not write the statistics in the same location.
When using Spark SQL's
ANALYZE TABLE
method, -only- table statistics show up in a Statistics key of the CatalogTable:Statistics: sizeInBytes=15196959528, rowCount=39763901, isBroadcastable=false
However, Hive View is blind to these stats, which are not listed in my table statistics
Conversely, when computing table or columns statistics within Hive View, I can collect table statistics with Spark SQL's
DESCRIBE EXTENDED
method, but they appear in the Properties key of my CatalogTable. It also indicates if some column statistics have been computed :Properties: [numFiles=20, transient_lastDdlTime=1540381765, totalSize=15196959528, COLUMN_STATS_ACCURATE={"COLUMN_STATS":{"bubble_level":"true","bubble_level_n1s":"true","timestamp":"true"}}]
Thus these pieces of information appear to be independent, and my question then becomes : which piece can be used by Spark to optimize execution plan ? I understand that some of these questions could be solved by upgrading to latest version of Spark, but this is not on my schedule for the moment.
The above-mentioned answer of Jacek Laskowski suggests Hive's statistics can be used if they are available through Spark SQL DESCRIBE EXTENDED
method.
Can anybody confirm this ?
Many thanks in advance for helping me to clear this up.