1

I'm not finding a way of changing the file name generated by glue jobs. It creates files called like 'run-xxxxx' but I want to modify that and use a specific name. Is this possible? PD: I'm using Python script (not scala)

Leandro
  • 870
  • 2
  • 13
  • 27
  • 1
    It is duplicated question. please look at: https://stackoverflow.com/questions/41990086/specifying-the-filename-when-saving-a-dataframe-as-a-csv – jbgorski Oct 05 '18 at 17:12

1 Answers1

1

Spark (and all other tools Hadoop ecosystem) use filenames as a mean to parallelise reads and writes; a spark job will produce as many files in a folder as there are partitions in it's RDD/Dataframe (often named part-XXX. When pointing Spark to a new datasource (be it S3, local FS or HDFS), you always point to a folder containing all the part-xxx files.

I don't know what kind of tool you're using, but if it depends on a filename convention then you'll have to rename your files (using your FS client) after the spark session has finished (it can be done in the Driver's code). Be aware that spark may (and usually does) produce multiple files. You can overcome that by calling coalesc on your DataFrame/RDD.

botchniaque
  • 4,698
  • 3
  • 35
  • 63
  • Thanks a lot, I didn't know which tool was generating those files and I was able to modify the script and change the name. – Leandro Oct 08 '18 at 13:17