I use Dataproc to run a Pyspark script that writes a dataframe to text files in google cloud storage bucket. When I run the script with big data, I automatically end up with a large number of text files in my output folder, but I want only one large file.
I read here Spark saveAsTextFile() writes to multiple files instead of one I can use .repartition(1) before .write() to get one file but I want it to run fast (of course) so I don't want to go back to one partition before performing the .write().
df_plain = df.select('id', 'string_field1').write.mode('append').partitionBy('id').text('gs://evatest/output', compression="gzip")