3

I have following questions on statistics collections on tables in Apache Spark

  1. Where does all the stats collected gets stored?. In the Metastore?
  2. In system where Spark and Hive shares a metastore, does the stats collected on a hive table by a hive application will be made available to the Spark optimizer?. Similarly does the stats collected by Spark on a hive table will be made available to Hive optimizer?
  3. Is it possible to force Spark to collect stats on a Dataframe loaded in memory or collect stats on a Temporary table created from a Dataframe?
rogue-one
  • 11,259
  • 7
  • 53
  • 75
  • 1
    see also https://stackoverflow.com/questions/39632724/how-does-computing-table-stats-in-hive-or-impala-speed-up-queries-in-spark-sql – Raphael Roth Oct 16 '18 at 05:34

1 Answers1

2
  1. It is stored in Hive Metastore. Specifically as table properties. Also formats like ORC and Parquet have per file and per block statistics that a reader can use. However it is not used by the optimizer.

  2. Spark and Hive use different parameter names to store statistics. So unfortunately they cannot use statistics collected by the other engine.

Specifically, after collecting statistics in Spark, table properties has:

TBLPROPERTIES (
  'numFiles'='1', 
  'numRows'='-1', 
  'rawDataSize'='-1', 
  'spark.sql.statistics.numRows'='111111', 
  'spark.sql.statistics.totalSize'='11111', 
  'totalSize'='111111',

After collecting statistics in Hive, table properties has:

TBLPROPERTIES ( 
  'numFiles'='1', 
  'numRows'='1111111', 
  'rawDataSize'='1111111',
vrajat
  • 118
  • 8