0

We have some AWS Glue jobs generating parquet files to one of our S3 bucket. I've read in many places to optimize my files sizes, making them have at least 128 MB - this is one reference.

Can it be done by Glue itself, e.g. with an auto generated Glue Script?

HoofarLotusX
  • 562
  • 2
  • 5
  • 11
Renato Bibiano
  • 371
  • 1
  • 2
  • 12
  • What size are the files? Are you trying to make the files bigger to reach 128 MB? In that case consolidate them before conversion. – John Hanley Oct 04 '18 at 16:53
  • I have both cases, bigger and smaller files. I am consolidating them using something like: ```partitioned_dataframe1 = datasource1.toDF().repartition(1)``` but this is a manual process, I would like Glue to make it by itself – Renato Bibiano Oct 04 '18 at 18:53
  • How many S3 requests do you have right now? Since you are using paraquet this doesn't make sense to me unless you want to go from 1000 requests per day to just a dozen. If you don't have a high request that also means getting close to function timeout (if you want to use lambda triggers) – HoofarLotusX Oct 04 '18 at 20:35
  • 1
    Its not possible. But there is a suggestion in the post below. https://stackoverflow.com/questions/39187622/how-do-you-control-the-size-of-the-output-file – Tanveer Uddin Oct 04 '18 at 22:45
  • Thank you for your replies, this post will help me then – Renato Bibiano Oct 05 '18 at 13:33
  • @RenatoBibiano what direction did you choose eventually? – ArielB Feb 09 '21 at 10:33
  • @ArielB we are not using Glue anymore – Renato Bibiano Feb 09 '21 at 18:59
  • What did you choose then? – ArielB Feb 09 '21 at 20:27
  • @ArielB we have created our own job, but now we would maybe have chosen DMS – Renato Bibiano Feb 11 '21 at 11:11

0 Answers0