Results of rdd.count, count via spark sql are the same, but they are different from count with hive sql

Question

I use count to calculate the number of RDD,got 13673153,but after I transfer the rdd to df and insert into hive,and count again,got 13673182,why?

rdd.count
spark.sql("select count(*) from ...").show()
hive sql: select count(*) from ...

It can also be the issue with statistics in Hive: https://stackoverflow.com/a/39914232/2700344 — leftjoin, Aug 15 '19 at 14:36

score 0 · Answer 1 · answered Aug 15 '19 at 14:12

This could be caused by a mismatch between data in the underlying files and the metadata registered in hive for that table. Try running:

MSCK REPAIR TABLE tablename;

in hive, and see if the issue is fixed. The command updates the partition information of the table. You can find more info in the documentation here.

thebluephantom · Answer 2 · 2019-08-16T07:45:02.497

0

During a Spark Action and part of SparkContext, Spark will record which files were in scope for processing. So, if the DAG needs to recover and reprocess that Action, then the same results are gotten. By design.

Hive QL has no such considerations.

UPDATE

As you noted, the other answer did not help in this use case.

So, when Spark processes Hive tables it looks at the list of files that it will use for the Action.

In the case of a failure (node failure, etc.) it will recompute data from the generated DAG. If it needs to go back and re-compute as far as the start of reading from Hive itself, then it will know which files to use - i.e the same files, so that same results are gotten instead of non-deterministic outcomes. E.g. think of partitioning aspects, handy that same results can be recomputed!

It's that simple. It's by design. Hope this helps.

edited Aug 16 '19 at 07:45

answered Aug 15 '19 at 16:30

thebluephantom

16,458
8
40
83

Updated answer. – thebluephantom Aug 16 '19 at 06:39
Thanks! I can understand what you are saying,but i think it didn't explain the reason for the difference. I change the hive table storing with parquet,instead of text file. and it worked.Maybe it has sth to to with the separtor '\t',because the fileds contains '\t'. – Wloverine Aug 19 '19 at 06:47
I am not sure I can follow, but what I state is a basic premise of the Spark DAG approach. Success. – thebluephantom Aug 19 '19 at 07:32

Results of rdd.count, count via spark sql are the same, but they are different from count with hive sql

2 Answers2