0

I am working on a use case wherein I need to save each line of text in RDD as a separate file in Google Cloud Storage.

The run platform is Databricks with spark version 3.2.x and language is Scala.

Can you please point me to relevant document that can help me do that?

We have methods to save text but not something that works on such a granularity (each line).

Gaël J
  • 11,274
  • 4
  • 17
  • 32
Aishwary Shukla
  • 450
  • 1
  • 7
  • 21

1 Answers1

2

You could control the number of records per file using maxRecordsPerFile property

val df = ...
df.write
    .option("maxRecordsPerFile", 1)
    ...
Islam Elbanna
  • 1,438
  • 2
  • 9
  • 15
  • Thanks for the details. One more query. If lets say, I have an dataframe with 2 columns, one with string data and another column with filename. Then, how can we achieve the same result of one file per string with the associated file name? Will a forach iteration over dataframe be required? Or is there a simpler way to do that? – Aishwary Shukla May 29 '23 at 12:58
  • if we are talking about small data, then I guess this is not a suitable use case for Spark, since iterating over the data means that you collect all data to the driver process, but in this case you could just collect the data and save in the specific files using native Scala – Islam Elbanna May 29 '23 at 13:42