I found this apache-parquet ticket https://issues.apache.org/jira/browse/PARQUET-686 which is marked as resolved for parquet-mr
1.8.2. The feature I want is the calculated min/max
in the parquet metadata for a (string
or BINARY
) column.
And referencing this is an email https://lists.apache.org/thread.html/%3CCANPCBc2UPm+oZFfP9oT8gPKh_v0_BF0jVEuf=Q3d-5=ugxSFbQ@mail.gmail.com%3E
which uses scala
instead of pyspark
as an example:
Configuration conf = new Configuration(); + conf.set("parquet.strings.signed-min-max.enabled", "true"); Path inputPath = new Path(input); FileStatus inputFileStatus = inputPath.getFileSystem(conf).getFileStatus(inputPath); List<Footer> footers = ParquetFileReader.readFooters(conf, inputFileStatus, false);
I've been unable to set this value in pyspark
(perhaps I'm setting it in the wrong place?)
example dataframe
import random
import string
from pyspark.sql.types import StringType
r = []
for x in range(2000):
r.append(u''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(10)))
df = spark.createDataFrame(r, StringType())
I've tried a few different ways of setting this option:
df.write.format("parquet").option("parquet.strings.signed-min-max.enabled", "true").save("s3a://test.bucket/option")
df.write.option("parquet.strings.signed-min-max.enabled", "true").parquet("s3a://test.bucket/option")
df.write.option("parquet.strings.signed-min-max.enabled", True).parquet("s3a://test.bucket/option")
But all of the saved parquet files are missing the ST/STATS for the BINARY column. Here is an example output of the metadata from one of the parquet files:
creator: parquet-mr version 1.8.3 (build aef7230e114214b7cc962a8f3fc5aeed6ce80828)
extra: org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"value","type":"string","nullable":true,"metadata":{}}]}
file schema: spark_schema
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
value: OPTIONAL BINARY O:UTF8 R:0 D:1
row group 1: RC:33 TS:515
---------------------------------------------------------------------------------------------------
Also, based on this email chain https://mail-archives.apache.org/mod_mbox/spark-user/201410.mbox/%3C9DEF4C39-DFC2-411B-8987-5B9C33842974@videoamp.com%3E and question: Specify Parquet properties pyspark
I tried sneaking the config in through the pyspark private API:
spark.sparkContext._jsc.hadoopConfiguration().setBoolean("parquet.strings.signed-min-max.enabled", True)
So I am still unable to set this conf parquet.strings.signed-min-max.enabled
in parquet-mr
(or it is set, but something else has gone wrong)
- Is it possible to configure
parquet-mr
from pyspark - Does pyspark 2.3.x support BINARY column stats?
- How do I take advantage of the PARQUET-686 feature to add
min/max
metadata for string columns in a parquet file?