How to control the size of output parquet files in spark

Question

I have a spark job that reads the data from the hive table.
Ex:

r = spark.sql("select * from table")

and I have to write the result to hdfs location with 256mb parquet files.

I am trying

r.write.parquet("/data_dev/work/experian/test11")

This generates 30MB files But I need it to generate 256MB files

I also tried these configurations

r.write.option("parquet.block.size", 256 * 1024 * 1024 ). \
               parquet("/path")

Still, the generated files seem to be ~30MB files

score 0 · Answer 1 · answered Aug 28 '19 at 18:30

0

I don't think there is any direct way possible to control the size in Spark. Please refer this link:

answered Aug 28 '19 at 18:30

Ashish

1 Answers1