Row count results MISMATCH for Select * and Select count(1) for Hive External table for Big Files

Question

I am running hive external table queries. Issue: 'Select * from table1' row count which hive shows is different 'Select count(*) from table1'. It should match but not matching not sure why? Result match for small data 20 MB or so but not for Big table i.e 600 MB they do not match..Any one has faced this issue ??

Below are some queries I ran to show the result. My source file is RDS file which I convert to csv file and upload it to HDFS and create external table.

additional details

Note: I only face this issue for big files e.g. size 200 MB or more but for small files e.g 80 MB there is no ussue.

SELECT count(*) FROM dbname1.cy_tablet where Ranid Is NULL # Zero results

How do you know the count of rows when you do `select * from ...`? — GMB, Sep 10 '20 at 10:20
@GMB it shows when you run query in Hive command line Interface. It shows number of rows and time taken to fetch them.I have included same in screenshot — Tokci, Sep 10 '20 at 10:53
Probably this figures are just estimates, hence the different results when there are lots of rows. `select count(*)` gives you the real count. — GMB, Sep 10 '20 at 10:57
@GMB When I do 'Select ranid from dbname1.cy_tablet;' then also row count is different. — Tokci, Sep 10 '20 at 11:14
See this answer: https://stackoverflow.com/a/39914232/2700344 — leftjoin, Sep 10 '20 at 11:15
@leftjoin I tried all steps except 'Analyze Table syntax' as it is throwing permission error BUT my results remain same as mentioned in question screenshot for oher 2 steps. First set hive.compute.query.using.stats=False; ran all queries ,no change in result,then did set hive.stats.autogather=true; and ran all queries no change in results. — Tokci, Sep 10 '20 at 12:03
@mazaneicha csv format and one file for a External Hive table — Tokci, Sep 11 '20 at 05:06
I would check if you have empty extra line at the end of your .csv. That would explain why line count (returned by `count(*)` doesnt match the count when you try to convert lines to actual fields (`select ranid...`). — mazaneicha, Sep 11 '20 at 13:36
@mazaneicha I get zero rows for SELECT count(*) FROM dbname1.cy_tablet where Ranid Is NULL — Tokci, Sep 11 '20 at 17:36
can you just run a `$ grep -cv "^$" your_file.csv` to check for empty lines? — mazaneicha, Sep 11 '20 at 18:21
@mazaneicha above command returned one row 1005501. I then cleaned it using grep -v '^$' yourfile.csv > yourfile_fixed.csv and again ran $ grep -cv "^$" your_file.csv got 1005501 again — Tokci, Sep 12 '20 at 10:59
@mazaneicha I face this issue for Big files e.g 200 MB and not small file 80 MB. Sometimes differences is of 5 — Tokci, Sep 12 '20 at 11:59
@mazaneicha I viewed file in excel it shows no blank rows in very last — Tokci, Sep 12 '20 at 12:08

score 0 · Accepted Answer · answered Oct 02 '20 at 14:41

We resolved the issue and all count match now. We removed headers in our csv file used as source to Hive External tables, by using col_names = FALSE

write_delim(df_data,delim = "|",col_names = FALSE, output_file)#

Removed following line from CREATE EXTERNAL TABLE command

TBLPROPERTIES('skip.header.line.count'='1'

Above steps resolved our issue.

The issue was happening in big files. In our site HDFS block size is 128MB, if we divide file size by 128MB gives us a number , I was getting same as difference. So I think the issue was with headers.

Note: We used pipe '|' as delimiter as we faced some other issues when using ','

Row count results MISMATCH for Select * and Select count(1) for Hive External table for Big Files

1 Answers1