Apache Spark : Including the partition columns in the parquet file

Asked Mar 10 '19 at 09:40

Active Mar 10 '19 at 09:40

Viewed 287 times

I have a huge data set partitioned by month. I am able to write the parquet files using the spark.write.parquet method. It works fine when trying to read using the spark itself. Parquet files don't have the partition columns and it is represented by the folders they reside. When trying to read the parquet files using external programs (like polybase), we cannot tell the month to which the file belongs to.

Is there any way to force spark to include the partition columns in the parquet files? Is there any other alternatives?

asked Mar 10 '19 at 09:40

Sam

3

A trivial workaround would be to add an additional column with different name per partition column. Something like `df.withColumn("my_month", col("month")).write.partitionBy("month").parquet(...)` – Grisha Weintraub Mar 10 '19 at 10:19
Possible duplicate of [Spark: can you include partition columns in output files?](https://stackoverflow.com/questions/48190107/spark-can-you-include-partition-columns-in-output-files), where the exact same answer was given. – Oli Mar 10 '19 at 17:53

Apache Spark : Including the partition columns in the parquet file

0 Answers0