0

I am running hive external table queries. Issue: 'Select * from table1' row count which hive shows is different 'Select count(*) from table1'. It should match but not matching not sure why? Result match for small data 20 MB or so but not for Big table i.e 600 MB they do not match..Any one has faced this issue ??

Below are some queries I ran to show the result. My source file is RDS file which I convert to csv file and upload it to HDFS and create external table. enter image description here

additional details

Note: I only face this issue for big files e.g. size 200 MB or more but for small files e.g 80 MB there is no ussue.

SELECT count(*) FROM dbname1.cy_tablet where Ranid Is NULL # Zero results
Tokci
  • 1,220
  • 1
  • 23
  • 31
  • How do you know the count of rows when you do `select * from ...`? – GMB Sep 10 '20 at 10:20
  • @GMB it shows when you run query in Hive command line Interface. It shows number of rows and time taken to fetch them.I have included same in screenshot – Tokci Sep 10 '20 at 10:53
  • Probably this figures are just estimates, hence the different results when there are lots of rows. `select count(*)` gives you the real count. – GMB Sep 10 '20 at 10:57
  • @GMB When I do 'Select ranid from dbname1.cy_tablet;' then also row count is different. – Tokci Sep 10 '20 at 11:14
  • 2
    See this answer: https://stackoverflow.com/a/39914232/2700344 – leftjoin Sep 10 '20 at 11:15
  • @leftjoin I tried all steps except 'Analyze Table syntax' as it is throwing permission error BUT my results remain same as mentioned in question screenshot for oher 2 steps. First set hive.compute.query.using.stats=False; ran all queries ,no change in result,then did set hive.stats.autogather=true; and ran all queries no change in results. – Tokci Sep 10 '20 at 12:03
  • What data format does your table use? – mazaneicha Sep 10 '20 at 19:18
  • @mazaneicha csv format and one file for a External Hive table – Tokci Sep 11 '20 at 05:06
  • 1
    I would check if you have empty extra line at the end of your .csv. That would explain why line count (returned by `count(*)` doesnt match the count when you try to convert lines to actual fields (`select ranid...`). – mazaneicha Sep 11 '20 at 13:36
  • @mazaneicha I get zero rows for SELECT count(*) FROM dbname1.cy_tablet where Ranid Is NULL – Tokci Sep 11 '20 at 17:36
  • can you just run a `$ grep -cv "^$" your_file.csv` to check for empty lines? – mazaneicha Sep 11 '20 at 18:21
  • @mazaneicha above command returned one row 1005501. I then cleaned it using grep -v '^$' yourfile.csv > yourfile_fixed.csv and again ran $ grep -cv "^$" your_file.csv got 1005501 again – Tokci Sep 12 '20 at 10:59
  • @mazaneicha I face this issue for Big files e.g 200 MB and not small file 80 MB. Sometimes differences is of 5 – Tokci Sep 12 '20 at 11:59
  • @mazaneicha I viewed file in excel it shows no blank rows in very last – Tokci Sep 12 '20 at 12:08

1 Answers1

0

We resolved the issue and all count match now. We removed headers in our csv file used as source to Hive External tables, by using col_names = FALSE

write_delim(df_data,delim = "|",col_names = FALSE, output_file)#

Removed following line from CREATE EXTERNAL TABLE command

TBLPROPERTIES('skip.header.line.count'='1'

Above steps resolved our issue.

The issue was happening in big files. In our site HDFS block size is 128MB, if we divide file size by 128MB gives us a number , I was getting same as difference. So I think the issue was with headers.

Note: We used pipe '|' as delimiter as we faced some other issues when using ','

Tokci
  • 1,220
  • 1
  • 23
  • 31