1

I am trying to generate a file from a Dataframe that I have created in AWS-Glue, I am trying to give it a specific name, I see most answers on stack overflow actually uses Filesystem modules, but here this particular csv file is generated in S3, also I want to give the file a name while generating it, and not rename it after it is generated, is there any way to do that?

I have tried using df.save(s3:://PATH/filename.csv) which actually generates a new directory in S3 named filename.csv and then generates part-*.csv inside that directory

df.repartition(1).write.mode('append').format('csv').save('s3://PATH').option("header", "true")
EngineJanwaar
  • 422
  • 1
  • 7
  • 14
  • what is the use case? why do you need custom names for the files generated by spark? – Harsh Bafna May 21 '19 at 10:10
  • Hey Harsh, we're taking a few input files from s3 in glue, doing some analysis, running a few sql queries on the dataframe and generating a csv output as an analysis report all via Glue, the final report doesn't contain any columns from the input files and has entirely new columns, I was able to generate the output also but to automate this process files should be generated with a custom name and timestamp. – EngineJanwaar May 21 '19 at 10:50
  • The steps should go like : 1) Read source data from S3 through glue catalog. 2) Execute SQL queries using Glue ETL. 3) Write the new DF back to a new location on S3. – Harsh Bafna May 21 '19 at 10:56
  • Now what do you want to do with the newly written csv files on S3? Perform read job to put on some analytical charts? – Harsh Bafna May 21 '19 at 10:57
  • Yes, these files will be consumed by another system which checks against a fixed naming convention – EngineJanwaar May 21 '19 at 11:04
  • It would be easier to create an athena external table on top of the newly generated data on S3. You will be able to query the data using plain SQL and will be much more faster as well compared to reading files from S3. – Harsh Bafna May 21 '19 at 11:06
  • Unfortunately those resources are not allocated to us, we're doing what we can using the very limited access that we have, If only there was a way to do this, we would change the name without incurring extra costs for other services. I'm not even allowed to create a temp table so that I can copy data one on one to csv reports in S3 – EngineJanwaar May 27 '19 at 11:42
  • Possible duplicate of [Spark - How to write a single csv file WITHOUT folder?](https://stackoverflow.com/questions/43661660/spark-how-to-write-a-single-csv-file-without-folder) - you'll likely want to do the solution using pandas. Spark doesn't write single files on its own otherwise. – bsplosion Jun 17 '19 at 13:38

0 Answers0