2

I have created an external table in Qubole(Hive) which reads parquet(compressed: snappy) files from s3, but on performing a SELECT * table_name I am getting null values for all columns except the partitioned column.

I tried using different serialization.format values in SERDEPROPERTIES, but I am still facing the same issue. And on removing the property 'serialization.format' = '1' I am getting ERROR: Failed with exception java.io.IOException:Can not read value at 0 in block -1 in file s3://path_to_parquet/.

I checked the parquet files and was able to read the data using parquet-tools:

**file_01.snappy.parquet:**
{"col_2":1234,"col_3":ABC}
{"col_2":124,"col_3":FHK}
{"col_2":12515,"col_3":UPO}


**External table stmt:**
CREATE EXTERNAL TABLE parquet_test
(
    col2 int,
    col3 string
)
PARTITIONED BY (col1 date) 
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES (
  'serialization.format' = '1'
)
STORED AS PARQUET
LOCATION 's3://path_to_parquet'
TBLPROPERTIES ('parquet.compress'='SNAPPY');

Result:
col_1   col_2   col_3
5/3/19  NULL    NULL
5/4/19  NULL    NULL
5/5/19  NULL    NULL
5/6/19  NULL    NULL

Expected Result:
col_1   col_2   col_3
5/3/19  1234    ABC
5/4/19  124     FHK
5/5/19  12515   UPO
5/6/19  1234    ABC
S.Mehra
  • 56
  • 1
  • 6
  • Hi, can you try: without specifying the row format serde? (delete the row with row format serde). If doesn't work, try again with: ```ROW FORMAT DELIMITED FIELDS TERMINATED BY ','``` – F.Lazarescu May 16 '19 at 07:59
  • Hey also can you please check the schema of the parquet file, as the Hive table `col_3` should be an `String`. Please do verify the schema and then recreate the Hive table to check if the load is proper. – DataWrangler May 16 '19 at 09:10
  • hey @F.Lazarescu, Thanks, but I tried removing the row format serde & added `ROW FORMAT DELIMITED FIELDS TERMINATED BY ','` but it gave me the same error in both the cases :( – S.Mehra May 16 '19 at 15:54
  • hey @Joby, I checked the schema and updated the schema in the question as well, that was by mistake. Thanks – S.Mehra May 16 '19 at 15:56

1 Answers1

0

Writing the below answer assuming that table was created using Hive and read using Spark(Since the question is tagged with apache-spark-sql)

How was the data created?

Spark supports case-sensitive schema. When we use dataframe APIs, it is possible to write using case sensitive schema.
Example:

scala> case class Employee(iD: Int, NaMe: String )
defined class Employee
scala> val df =spark.range(10).map(x => Employee(x.toInt, s"name$x")).write.save("file:///tmp/data/")
scala> spark.read.parquet("file:///tmp/data/").printSchema
root
 |-- iD: integer (nullable = true)
 |-- NaMe: string (nullable = true)

Notice that in the above example case sensitivity is preserved.
When we create a Hive table on top of the data created from Spark, Hive will be able to read it right since it is not cased sensitive.
Whereas when the same data is read using Spark, it uses the schema from Hive which is lower case by default, and the rows returned is null.
To overcome this, Spark has introduced a config spark.sql.hive.caseSensitiveInferenceMode.

object HiveCaseSensitiveInferenceMode extends Enumeration {
  val INFER_AND_SAVE, INFER_ONLY, NEVER_INFER = Value
}

val HIVE_CASE_SENSITIVE_INFERENCE = buildConf("spark.sql.hive.caseSensitiveInferenceMode")
  .doc("Sets the action to take when a case-sensitive schema cannot be read from a Hive " +
    "table's properties. Although Spark SQL itself is not case-sensitive, Hive compatible file " +
    "formats such as Parquet are. Spark SQL must use a case-preserving schema when querying " +
    "any table backed by files containing case-sensitive field names or queries may not return " +
    "accurate results. Valid options include INFER_AND_SAVE (the default mode-- infer the " +
    "case-sensitive schema from the underlying data files and write it back to the table " +
    "properties), INFER_ONLY (infer the schema but don't attempt to write it to the table " +
    "properties) and NEVER_INFER (fallback to using the case-insensitive metastore schema " +
    "instead of inferring).")
  .stringConf
  .transform(_.toUpperCase(Locale.ROOT))
  .checkValues(HiveCaseSensitiveInferenceMode.values.map(_.toString))
  .createWithDefault(HiveCaseSensitiveInferenceMode.INFER_AND_SAVE.toString)

INFER_AND_SAVE - Spark infers the schema and store in metastore as part of table's TBLEPROPERTIES (desc extended <table name> should reveal this)
If the value of the property is NOT either INFER_AND_SAVE or INFER_ONLY, then Spark uses the schema from metastore table, and wil not be able to read the parquet files. The default value of the property is INFER_AND_SAVE since Spark 2.2.0.

We could check the following to see if the problem is related to schema sensitivity:
1. Value of spark.sql.hive.caseSensitiveInferenceMode (spark.sql("set spark.sql.hive.caseSensitiveInferenceMode") should reveal this)
2. If the data created using Spark
3. If 2 is true, check if the Schema is case sensitive(spark.read(<location>).printSchema) 4. if 3 uses case-sensitive schema and output from 1 is not INFER_AND_SAVE/INFER_ONLY, set spark.sql("set spark.sql.hive.caseSensitiveInferenceMode=INFER_AND_SAVE"), drop the table, recreate the table and try to read the data from Spark.

DaRkMaN
  • 1,014
  • 6
  • 9