0

I have a Hive source table which contains:

select count(*) from dev_lkr_send.pz_send_param_ano;
--25283 lines

I am trying to get all of the table lines and put them into a dataframe using Spark2-Scala. I did the following:

val dfMet = spark.sql(s"""SELECT
    CD_ANOMALIE,
    CD_FAMILLE,
    libelle AS LIB_ANOMALIE,
    to_date(substr(MAJ_DATE, 1, 19), 'YYYY-MM-DD HH24:MI:SS') AS DT_MAJ,
    CLASSIFICATION,
    NB_REJEUX,
    case when indic_cd_erreur = 'O' then 1 else 0 end AS TOP_INDIC_CD_ERREUR,
    case when invalidation_coordonnee = 'O' then 1 else 0 end AS TOP_COORDONNEE_INVALIDE,
    case when typ_mvt = 'S' then 1 else 0 end AS TOP_SUPP,
    case when typ_mvt = 'S' then to_date(substr(dt_capt, 1, 19), 'YYYY-MM-DD HH24:MI:SS') else null end AS DT_SUPP
FROM ${use_database}.pz_send_param_ano""")

When I execute dfMet.count() it returns: 46314

Any ideas about the source of the difference?


EDIT1:

Trying the same query from Hive returns the same value as in the dataframe (I was querying from Impala UI before).

Someone can explain the difference please? I am working on Hue4.

Haha
  • 973
  • 16
  • 43
  • Are you sure `use_database` is same as `dev_lkr_send`? Also is this table in meta-store or temporary table? – gorros Jul 12 '19 at 08:22
  • Yes use_database value is correct. I checked it. Actually I am trying to load values from Hive. Connection I use the following to connect: val spark = SparkSession.builder().appName("AB Test").enableHiveSupport().getOrCreate() – Haha Jul 12 '19 at 08:28
  • Is `select count(*) from dev_lkr_send.pz_send_param_ano;` done also via Spark SQL? – gorros Jul 12 '19 at 08:30
  • No. I execute it in Hive to check my source size. In spark SQL I execute dfMet.count() to verify if my DF loaded all of my data – Haha Jul 12 '19 at 08:33
  • I Hive you have only file locations, not actual data. Try to `select * from dev_lkr_send.pz_send_param_ano` – gorros Jul 12 '19 at 08:34
  • I don't understand. select count(*) from dev_lkr_send.pz_send_param_ano; doesn't return real number of data rows? How can I tell how much data to expect in Spark then? – Haha Jul 12 '19 at 08:35
  • I though you are executing query directly connecting to Hive Metastore,but now I understand that you connect to Hive. Maybe there is issue with partitions, try to repair it with `MSCK REPAIR TABLE tablename`. – gorros Jul 12 '19 at 08:44
  • Read also this: https://stackoverflow.com/a/39914232/2700344 – leftjoin Jul 12 '19 at 19:00

1 Answers1

0

A potential source of difference is your Hive query is returning the result from the metastore which is out of date rather than running a fresh count against the table.

If you have hive.compute.query.using.stats set to true and the table has stats computed then it will be returning the result from the metastore. If this is the case then it could be your stats are out of date and you need to recompute them.

Mark
  • 81
  • 6
  • where can I find hive.compute.query.using.stats? And what does recompute stats means ? – Haha Jul 12 '19 at 10:32
  • In the hive shell if you run 'SET;' it will list the config and hive.compute.query.using.stats should be listed in the output. Can you tell me if you use just Hive or Hive and Impala? – Mark Jul 12 '19 at 13:13