save each line of string as a file in spark

Question

I am working on a use case wherein I need to save each line of text in RDD as a separate file in Google Cloud Storage.

The run platform is Databricks with spark version 3.2.x and language is Scala.

Can you please point me to relevant document that can help me do that?

We have methods to save text but not something that works on such a granularity (each line).

So you have a dataframe of strings and you need to save each record in a file? — Islam Elbanna, May 29 '23 at 03:03
That is right. Each row of the dataframe to a separate file. — Aishwary Shukla, May 29 '23 at 12:33

score 2 · Answer 1 · answered May 29 '23 at 12:49

2

You could control the number of records per file using maxRecordsPerFile property

val df = ...
df.write
    .option("maxRecordsPerFile", 1)
    ...

answered May 29 '23 at 12:49

Islam Elbanna

Thanks for the details. One more query. If lets say, I have an dataframe with 2 columns, one with string data and another column with filename. Then, how can we achieve the same result of one file per string with the associated file name? Will a forach iteration over dataframe be required? Or is there a simpler way to do that? – Aishwary Shukla May 29 '23 at 12:58
if we are talking about small data, then I guess this is not a suitable use case for Spark, since iterating over the data means that you collect all data to the driver process, but in this case you could just collect the data and save in the specific files using native Scala – Islam Elbanna May 29 '23 at 13:42

1 Answers1