0

so as the title suggests. I think I remember that was some sort of option for glue jobs to generate a single csv output file instead of multiple ones, this was specific to some glue configuration and independent of any apache spark related functions. What are the setting changes in the pyspark file that are required to achieve this? Thanks a log in advance

Wassily
  • 162
  • 1
  • 13

1 Answers1

0

You could specify the format to write using the options in DynamicFrameWriter class, example snippet below

glue_context.write_dynamic_frame.from_options(
   connection_options = {"path": "$outpath", "partitionKeys": ["type"]},
   format = "csv")

You can find the list of supported formats here

PS: Code snippet is based on python API, but if you are using scala API it should be similar as well

Somasundaram Sekar
  • 5,244
  • 6
  • 43
  • 85
  • thanks! quick question, though, I already specified csv as my output format, my issue is just that I want to avoid multiple csv partition output files, do I do this by changing an attribute for the partition_keys property? – Wassily Apr 09 '19 at 14:54
  • partition_keys are used to specify if you want to repartition the data while saving. If you want to avoid writing multiple files, one way I can think of is convert DynamicFrame into spark SQL Dataframe and then coalesce(1) and then convert it back to DynamicFrame(may be there is an API in DynamicFrame itself, please check), but you need to be absolutely sure that the resulting dataframe will fit in the memory of a single executor(account for the overhead memory as well) – Somasundaram Sekar Apr 09 '19 at 15:06