How to write to a singlefile from from spark rdd map and reduce operations

Question

I am trying to write to a text file after applying the map, reduce operations. The below code is creating 8 files, but I need only one file

df3.rdd.map(_.toSeq.map(_+"").reduce(_+" "+_)).saveAsTextFile("/home/ram/Desktop/test4")

Please suggest how to write content to a single file

use .coalesce(1) before save – chlebek Oct 25 '19 at 08:03 — chlebek, Oct 25 '19 at 08:03

score 1 · Answer 1 · answered Oct 25 '19 at 22:30

The best option is "coalesce". The coalesce method reduces the number of partitions in a DataFrame.

here is the code for your question.

df3.coalesce(1).rdd.map(_.toSeq.map(_+"").reduce(_+" "+_)).saveAsTextFile("/home/ram/Desktop/test4")

Because it will give good performance by avoiding data movement. please check the below link.

Spark - repartition() vs coalesce()

score 0 · Answer 2 · answered Oct 25 '19 at 08:04

It is creating multiple files because each partition is saved individually. If you need a single output file inside a folder then you can repartition or coalesce to write to a single file.

df3.repartition(1).rdd.map(_.toSeq.map(_+"").reduce(_+" "+_)).saveAsTextFile("/home/ram/Desktop/test4")

How to write to a singlefile from from spark rdd map and reduce operations

2 Answers2