1

I have a hive table created on top of s3 DATA in parquet format and partitioned by one column named eventdate.

1) When using HIVE QUERY, it returns data for a column named "headertime" which is in the schema of BOTH the table and the file.

select headertime from dbName.test_bug where eventdate=20180510 limit 10

2) FROM a scala NOTEBOOK , when directly loading a file from a particular partition that also works,

val session = org.apache.spark.sql.SparkSession.builder 
.appName("searchRequests") 
.enableHiveSupport() 
.getOrCreate;

val searchRequest = session.sqlContext.read.parquet("s3n://bucketName/module/search_request/eventDate=20180510")

searchRequest.createOrReplaceTempView("SearchRequest")

val exploreDF = session.sql("select headertime from SearchRequest where SearchRequestHeaderDate='2018-05-10' limit 100")

exploreDF.show(20)

this also displays the values for the column "headertime"

3) But, when using spark sql to query directly the HIVE table as below,

val exploreDF = session.sql("select headertime from tier3_vsreenivasan.test_bug where eventdate=20180510 limit 100")

exploreDF.show(20)

it keeps returning null always.

I opened the parquet file and see that the column headertime is present with values, but not sure why spark SQL is not able to read the values for that column.

it will be helpful if someone can point out from where the spark SQL gets the schema? I was expecting it to behave similar to the HIVE QUERY

user2221654
  • 311
  • 1
  • 7
  • 20

0 Answers0