Saving PySpark Dataframe as an .csv in specific S3 Bucket location

Asked Jul 17 '23 at 10:45

Active Jul 17 '23 at 10:45

Viewed 42 times

I'm using this chunk of code to save my Dataframe on specific S3 bucket location:

df.coalesce(1).write\
        .format("csv")\
        .mode("append")\
        .save(f"s3://{bucket_output}/{dirname}/{filename}", header=True, nullValue = '\u0000', emptyValue = '\u0000')

I couldn't find anywhere in the web, information about changing the localization and the name of such a .csv file, using Python from a Glue job. Now, the csv file is saved not as a filename file, but in the directory named filename and the name of this csv is part-(some_numbers).csv.

How to get around it? Any move operation on S3 bucket or something?

asked Jul 17 '23 at 10:45

Dawid_K

Does this answer your question? [Specifying the filename when saving a DataFrame as a CSV](https://stackoverflow.com/questions/41990086/specifying-the-filename-when-saving-a-dataframe-as-a-csv) – boyangeor Jul 18 '23 at 05:05
Answer you provided is based on Scala, not Python. – Dawid_K Jul 18 '23 at 08:01
The point is that you cannot set the file name via Spark, it has to be renamed. How to rename it, that depends on the underlying storage system. Starting point for S3: [link](https://stackoverflow.com/questions/32501995/boto3-s3-renaming-an-object-using-copy-object) – boyangeor Jul 18 '23 at 09:10

Saving PySpark Dataframe as an .csv in specific S3 Bucket location

0 Answers0