2

I use Dataproc to run a Pyspark script that writes a dataframe to text files in google cloud storage bucket. When I run the script with big data, I automatically end up with a large number of text files in my output folder, but I want only one large file.

I read here Spark saveAsTextFile() writes to multiple files instead of one I can use .repartition(1) before .write() to get one file but I want it to run fast (of course) so I don't want to go back to one partition before performing the .write().

df_plain = df.select('id', 'string_field1').write.mode('append').partitionBy('id').text('gs://evatest/output', compression="gzip")
eml
  • 409
  • 2
  • 5
  • 19

1 Answers1

3

Don't think of GCS as a filesystem. The content of a GCS bucket is a set of immutable blobs (files). Once written, they can't be changed. My recommendation is to let your job write all the files independently and aggregate them at the end. There are a number of ways to achieve this.

The easiest way to achieve this is through the gsutil compose command.

References:

Kolban
  • 13,794
  • 3
  • 38
  • 60
  • Thank you, that's the conclusion I came to. Is there any way to work this into my pyspark? Maybe using subprocess, is that reliable? Is there any advantage to running it from Cloud Functions? – eml Nov 07 '19 at 16:41
  • The combination of the multiple files into the single file is performed locally within GCP. As such, where you submit the request to combine from is not an issue. Here is a link to the API ... https://googleapis.dev/python/storage/latest/blobs.html or https://googleapis.dev/nodejs/storage/latest/Bucket.html#combine You could invoke this from where ever you want. If your application knows when all the files have been written, that may be an ideal place to combine. – Kolban Nov 07 '19 at 17:11