8

I am currently doing some data exploration with Hive and cannot explain the following behavior. Say I have a table (named mytable) with a field master_id.

When I count the number of row I get

select count(*) as c from mytable 
c
1129563

If I want to count the number of row with a non null master_id, I get a higher number

select count(*) as c from mytable where master_id is not null
c
1134041

Additionally, the master_id seems to be never null.

select count(*) as c from mytable where master_id is null
c
0

I cannot explain how adding a where statement can increase the number of rows eventually. Does anyone have any hint to explain this behavior ?

Thanks

leftjoin
  • 36,950
  • 8
  • 57
  • 116
z_eb
  • 93
  • 1
  • 6

1 Answers1

10

Most probably your query without where is using statistics because of this parameter is set:

set hive.compute.query.using.stats=true;

Try to set it false and execute again.

Alternatively you can compute statistics on the table. See ANALYZE TABLE SYNTAX

Also it's possible to gather statistics during INSERT OVERWRITE automatically:

set hive.stats.autogather=true;
leftjoin
  • 36,950
  • 8
  • 57
  • 116