Must run "analyze table" in hive else count(*) shows 0

Question

I am creating an external Hive table on a parquet file on S3. The commands look like

CREATE EXTERNAL TABLE userinfo(
  user_id string,
  last_name string,
  first_name string
)
PARTITIONED BY (
  yr string,
  mo string)
ROW FORMAT SERDE
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  's3://mybucket/basedir/'
TBLPROPERTIES (
  'serialization.null.format'='');

alter table userinfo add IF NOT EXISTS partition (yr='2021', mo='07');

At this point, if I run "select count(*) from userinfo", I get 0 as the result. But if I then run

ANALYZE TABLE userinfo PARTITION(yr='2021', mo='07') COMPUTE STATISTICS;

and rerun the "select count(*)..." I get the expected row count.

This isn't a show-stopper, but it makes me think I'm doing something/failing to do something that's causing this strange behavior. Any insights are welcome.

See also: https://stackoverflow.com/a/39914232/2700344 – leftjoin Jul 21 '21 at 05:42 — leftjoin, Jul 21 '21 at 05:42

score 1 · Accepted Answer · answered Jul 20 '21 at 17:19

You are not telling optimizer the statistics of the table so it shows 0 because it doesnt know anything about the data in it.
Ideally you should gather statistics whenever you load data into a table/partition. Stale statistics or no statistics will keep optimizer thinking/guessing and thus making queries run longer/consume more memory and sometime return wrong values.
Now, you can setup this property in hive for auto stats gathering -

hive.stats.autogather=true

Or you can gather them manually right after loading data into it.

ANALYZE TABLE tab COMPUTE STATISTICS;

Must run "analyze table" in hive else count(*) shows 0

1 Answers1