3

I have an aws glue python job which joins two Aurora tables and writes/sinks output to s3 bucket as json format. The job is working fine as expected. By default the output file is written to s3 bucket in this name format/pattern "run-123456789-part-r-00000" [Behind the scene its running pyspark code in a hadoop cluster, so the file name is hadoop-like]

Now, my question is how to write the file with a specific name like "Customer_Transaction.json" instead of "run-***-part****"

I tried converting to DataFrame and then write as json, like below, but did not work

customerDF.repartition(1).write.mode("overwrite").json("s3://bucket/aws-glue/Customer_Transaction.json")

Kiran
  • 85
  • 1
  • 1
  • 5

2 Answers2

0

Glue under the hood is a spark job. And it is they way how spark saves files. The workaround: after you saved DataFrame, raname resulted file.

similar quetins in the scope of spark jobs: Specifying the filename when saving a DataFrame as a CSV

Natalia
  • 4,362
  • 24
  • 25
  • 1
    Thanks Natalia for the workaround. The solution in the above URL is using Scala programming. I am looking for similar library in pyspark (in aws glue). Do you have any recommendation. – Kiran May 05 '18 at 21:28
-2

I think I got the solution. Here is the code snippet that worked in my local hadoop-spark environment. Need to test in AWS Glue

Path = sc._gateway.jvm.org.apache.hadoop.fs.Path
FileSystem = sc._gateway.jvm.org.apache.hadoop.fs.FileSystem
FileStatus = sc._gateway.jvm.org.apache.hadoop.fs.FileStatus

fs = FileSystem.get(sc._jsc.hadoopConfiguration())
srcpath = Path("/user/cloudera/IMG_5252.mov")
dstpath = Path("/user/cloudera/IMG_5252_123.mov")
if(fs.exists(srcpath) == False):
    print("Input path does not exists")
else:
    #print("Path exists")
    srcpath.rename(srcpath,dstpath)
Kiran
  • 85
  • 1
  • 1
  • 5