I have a Hive source table which contains:
select count(*) from dev_lkr_send.pz_send_param_ano;
--25283 lines
I am trying to get all of the table lines and put them into a dataframe using Spark2-Scala
. I did the following:
val dfMet = spark.sql(s"""SELECT
CD_ANOMALIE,
CD_FAMILLE,
libelle AS LIB_ANOMALIE,
to_date(substr(MAJ_DATE, 1, 19), 'YYYY-MM-DD HH24:MI:SS') AS DT_MAJ,
CLASSIFICATION,
NB_REJEUX,
case when indic_cd_erreur = 'O' then 1 else 0 end AS TOP_INDIC_CD_ERREUR,
case when invalidation_coordonnee = 'O' then 1 else 0 end AS TOP_COORDONNEE_INVALIDE,
case when typ_mvt = 'S' then 1 else 0 end AS TOP_SUPP,
case when typ_mvt = 'S' then to_date(substr(dt_capt, 1, 19), 'YYYY-MM-DD HH24:MI:SS') else null end AS DT_SUPP
FROM ${use_database}.pz_send_param_ano""")
When I execute dfMet.count()
it returns: 46314
Any ideas about the source of the difference?
EDIT1:
Trying the same query from Hive returns the same value as in the dataframe (I was querying from Impala UI before).
Someone can explain the difference please? I am working on Hue4.