Parquet files not having stats from String column when using Hive

Question

I used Spark to write some data directly to Parquet, one without Hive and one with Hive. This is how I write directly without Hive

cube_op.sort("asn").write.parquet("/home/hadoop/work/aaa/agg1")

I see that min max stats are present for all columns. However when I run the same and write to Hive, I do not see min max for the string columns. Here is how I write using Hive

cube_op.sort("asn").write.insertInto("tbl1")

These are the properties I have given

spark.sql("SET spark.sql.parquet.binaryAsString=true")
"parquet.strings.signed-min-max.enabled": "true"

I am not sure what could be the reason for the difference. This is the hive version.

hive --version
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/ais/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/jars/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
...
Hive 1.1.0-cdh5.16.2

This is apparently the parquet version with Hive

creator: parquet-mr version 1.6.0 (build 6aa21f877662518059cfebe7f2e00cb)

what is spark version and parquet dependency version is it spark-1.6.0+cdh5.16.2 ? — Ram Ghadiyaram, Mar 29 '20 at 06:33
Does this answer your question? [PySpark Write Parquet Binary Column with Stats (signed-min-max.enabled)](https://stackoverflow.com/questions/53158121/pyspark-write-parquet-binary-column-with-stats-signed-min-max-enabled) — mazaneicha, Mar 29 '20 at 14:51
This could be due to the parquet version. I see a difference between the two parquet versions i.e when I write directly vs when I write via Hive — CSUNNY, Mar 30 '20 at 18:52
I did try the approach in the stack overflow, but did not work. Thanks — CSUNNY, Mar 30 '20 at 18:53

Parquet files not having stats from String column when using Hive

0 Answers0