Spark SQL returns null for a column in HIVE table while HIVE query returns non null values

Question

I have a hive table created on top of s3 DATA in parquet format and partitioned by one column named eventdate.

1) When using HIVE QUERY, it returns data for a column named "headertime" which is in the schema of BOTH the table and the file.

select headertime from dbName.test_bug where eventdate=20180510 limit 10

2) FROM a scala NOTEBOOK , when directly loading a file from a particular partition that also works,

val session = org.apache.spark.sql.SparkSession.builder 
.appName("searchRequests") 
.enableHiveSupport() 
.getOrCreate;

val searchRequest = session.sqlContext.read.parquet("s3n://bucketName/module/search_request/eventDate=20180510")

searchRequest.createOrReplaceTempView("SearchRequest")

val exploreDF = session.sql("select headertime from SearchRequest where SearchRequestHeaderDate='2018-05-10' limit 100")

exploreDF.show(20)

this also displays the values for the column "headertime"

3) But, when using spark sql to query directly the HIVE table as below,

val exploreDF = session.sql("select headertime from tier3_vsreenivasan.test_bug where eventdate=20180510 limit 100")

exploreDF.show(20)

it keeps returning null always.

I opened the parquet file and see that the column headertime is present with values, but not sure why spark SQL is not able to read the values for that column.

it will be helpful if someone can point out from where the spark SQL gets the schema? I was expecting it to behave similar to the HIVE QUERY

tier3_vsreenivasan is the dbName , in the first query above i just mentioned it as dbName.test_bug, but they are the same thing. — user2221654, May 11 '18 at 19:22
What is the type of `headertime`? Is it present in _all_ Parquet files or do you need "schema merging"? — Samson Scharfrichter, May 11 '18 at 19:45
In the hive table its column type is bigint and when I do the dataframe.printSchema it is long. — user2221654, May 11 '18 at 20:00
regarding schema merging, is that required even when loading from a specific partition? i assume partition pruning should just load from that particular partition when using spark sql ? Also, how to enable schema merging when using spark sql? — user2221654, May 11 '18 at 20:02
that column is not present in all the partitions, was only added in the later partitions. — user2221654, May 11 '18 at 20:06
_"how to enable schema merging"_ > just google `site:spark.apache.org parquet schema merging` ...! — Samson Scharfrichter, May 11 '18 at 22:46
Possible duplicate of [Spark SQL sql("").first().getDouble(0) give me inconsistent results](https://stackoverflow.com/questions/50269797/spark-sql-sqlsome-aggregate-query-first-getdouble0-give-me-inconsisten) — pushpavanthar, May 12 '18 at 00:53
did you find a resolution to this? can you suggest the same here, if yes — Kanav Sharma, Apr 27 '20 at 13:40

Spark SQL returns null for a column in HIVE table while HIVE query returns non null values

0 Answers0