0

I have an unpartitioned EXTERNAL table:

CREATE EXTERNAL TABLE `db.tableName`(
  `sid` string,
  `uid` int,
  `t1` timestamp,
  `t2` timestamp)
ROW FORMAT SERDE
  'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
  'field.delim'=',',
  'serialization.format'=',')
STORED AS INPUTFORMAT
  'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://<db_location>/tableName'
TBLPROPERTIES (
  'serialization.null.format'='',
  'transient_lastDdlTime'='1551121065')

When I copy the file tableName.csv to s3://db_location/tableName/tableName.csv and then run msck repair table db.tableName, I get the count back as zero. There are 10 rows in the CSV and I expect to get the count back as 10.

Any help is appreciated.

dreddy
  • 463
  • 1
  • 7
  • 21
  • It can be an issue with statistics: https://stackoverflow.com/a/39914232/2700344 And you do not need REPAIR on unpartitioned table, REPAIR adds partitions which exist only on S3, but missing in Hive metadata – leftjoin Apr 15 '19 at 19:08
  • Thank you ANALYZE TABLE db.tableName COMPUTE STATISTICS FOR COLUMNS; This worked. But why does this need to be run? – dreddy Apr 15 '19 at 19:55
  • because this property is set hive.compute.query.using.stats=true; and statistics is stale after loading file. You need analyze after each load if you want fast count work. Or disable it set hive.compute.query.using.stats=false; Then it will start map-reduce and will work slow – leftjoin Apr 15 '19 at 20:01
  • Possible duplicate of [HIVE select count(\*) non null returns higher value than select count(\*)](https://stackoverflow.com/questions/39912784/hive-select-count-non-null-returns-higher-value-than-select-count) – leftjoin Apr 15 '19 at 20:04

0 Answers0