msck repair table not working on unpartitioned table - hive config issue

Question

I have an unpartitioned EXTERNAL table:

CREATE EXTERNAL TABLE `db.tableName`(
  `sid` string,
  `uid` int,
  `t1` timestamp,
  `t2` timestamp)
ROW FORMAT SERDE
  'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
  'field.delim'=',',
  'serialization.format'=',')
STORED AS INPUTFORMAT
  'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://<db_location>/tableName'
TBLPROPERTIES (
  'serialization.null.format'='',
  'transient_lastDdlTime'='1551121065')

When I copy the file tableName.csv to s3://db_location/tableName/tableName.csv and then run msck repair table db.tableName, I get the count back as zero. There are 10 rows in the CSV and I expect to get the count back as 10.

Any help is appreciated.

It can be an issue with statistics: https://stackoverflow.com/a/39914232/2700344 And you do not need REPAIR on unpartitioned table, REPAIR adds partitions which exist only on S3, but missing in Hive metadata — leftjoin, Apr 15 '19 at 19:08
Thank you ANALYZE TABLE db.tableName COMPUTE STATISTICS FOR COLUMNS; This worked. But why does this need to be run? — dreddy, Apr 15 '19 at 19:55
because this property is set hive.compute.query.using.stats=true; and statistics is stale after loading file. You need analyze after each load if you want fast count work. Or disable it set hive.compute.query.using.stats=false; Then it will start map-reduce and will work slow — leftjoin, Apr 15 '19 at 20:01
Possible duplicate of [HIVE select count(\*) non null returns higher value than select count(\*)](https://stackoverflow.com/questions/39912784/hive-select-count-non-null-returns-higher-value-than-select-count) — leftjoin, Apr 15 '19 at 20:04

msck repair table not working on unpartitioned table - hive config issue

0 Answers0