0

I am using spark 2.4.4 and hive 2.3 ...

Using spark, I am loading a dataframe as Hive table using DF.insertInto(hiveTable)

if new table is created during run (of course before insertInto thru spark.sql) or existing tables created by spark 2.4.4 - everything works fine.

Issue is, if I am attempting to load some existing tables (older tables created spark 2.2 or before) - facing issues with COUNT of records. Diff count when count of target hive table is done thru beeline vs spark sql.

Please assist.

VimalK
  • 65
  • 1
  • 8
  • It can be issue with stale statistics: https://stackoverflow.com/a/39914232/2700344 – leftjoin Feb 20 '22 at 09:20
  • Found the solution for the issue. There seems to be an iss uein sync with HiveMetastore & Spark-Catalog for tables created before Spark2.2 & trying to update using spark 2.4.4 - especially on mapRfs filesystem. In general case, spark.catalog.refresh will update teh catalog if any change to hive table & stats are refreshed. – VimalK Apr 19 '22 at 06:56

1 Answers1

0

There seems to be an issue with sync of hive-Metastore and spark-catalog for hive tables (with parquet file format) created o2.n spark 2 (or before - with comple /nested data tydata type) and loaded using spark v2.4.

Usual case, spark.catalog.refresh(<hive-table-name>) will refresh the stats from hiveMetastore to spark.catalog.

In this case, an explicit spark.catalog.resfreshByPath(<location-maprfs-path>) need to bed executed to refresh the stats.pet*

VimalK
  • 65
  • 1
  • 8