0

In this documentation: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-format.html#aws-glue-programming-etl-format-parquet

it mentions: "any options that are accepted by the underlying SparkSQL code can be passed to it by way of the connection_options map parameter."

However, how can I find out what those options are? There's not a clear mapping between the Glue code and the SparkQL code.

(Specifically, I want to figure out how to control the size of the resulting parquet files)

Narfanator
  • 5,595
  • 3
  • 39
  • 71
  • 2
    Unfortunately there is no such option to control size of parquet files. There is a [trick](https://stackoverflow.com/questions/39187622/how-do-you-control-the-size-of-the-output-file) using coalesce though. – Yuriy Bondaruk Jun 30 '18 at 14:13
  • Yeah :/ Apparently the closest we can get is to set `repartition(n)` prior to write out, which'll then produce n files (per partition key combo, if you're also using those) – Narfanator Jul 02 '18 at 21:21

1 Answers1

1

SparkSQL options for various DataSources can be looked up in DataFrameWriter documentation (in Scala or pyspark docs). Datasource for writing parquet seems to only take compression parameter. For SparkSQL options when reading the data, have a look into DataFrameReader class.

To control the size of your output files you should play with parallelism - like @Yuri Bondaruk commented - using for example coalesc function.

Narfanator
  • 5,595
  • 3
  • 39
  • 71
botchniaque
  • 4,698
  • 3
  • 35
  • 63
  • Still not finding it clear. For instance, there's a write mode (append, overwrite, ignore, etc) option for certain operations. Can that be passed in? How? – Narfanator Jul 05 '18 at 21:14
  • I don't think you can set the `saveMode` via option. _Options_ are all custom parameters for [DataSources](https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-datasource-api.html) – botchniaque Jul 05 '18 at 21:49
  • Savemode: `spark.write.mode(SaveMode.Append)` – Narfanator Oct 25 '18 at 21:35