I believe you are confused about the way Spark behaves, I would recommend you to read the official documentation and / or some tutorial first.
Nevertheless, I hope this answers your question.
This code will save a DataFrame
as a SINGLE CSV File on a local filesystem.
It was tested with Spark 2.4.0
with Scala 2.12.8
on an Ubuntu 18.04
laptop.
import org.apache.spark.sql.SparkSession
val spark =
SparkSession
.builder
.master("local[*]")
.appName("CSV Writter Test")
.getOrCreate()
import spark.implicits._
val df =
Seq(
("Alex", "2018-01-01 00:00:00", "2018-02-01 00:00:00", "OUT"),
("Bob", "2018-02-01 00:00:00", "2018-02-05 00:00:00", "IN"),
("Mark", "2018-02-01 00:00:00", "2018-03-01 00:00:00", "IN"),
("Mark", "2018-05-01 00:00:00", "2018-08-01 00:00:00", "OUT"),
("Meggy", "2018-02-01 00:00:00", "2018-02-01 00:00:00", "OUT")
).toDF("NAME", "START_DATE", "END_DATE", "STATUS")
df.printSchema
// root
// |-- NAME: string (nullable = true)
// |-- START_DATE: string (nullable = true)
// |-- END_DATE: string (nullable = true)
// |-- STATUS: string (nullable = true)
df.coalesce(numPartitions = 1)
.write
.option(key = "header", value = "true")
.option(key = "sep", value = ",")
.option(key = "encoding", value = "UTF-8")
.option(key = "compresion", value = "none")
.mode(saveMode = "OVERWRITE")
.csv(path = "file:///home/balmungsan/dailyReport/") // Change the path. Note there are 3 /, the first two are for the file protocol, the third one is for the root folder.
spark.stop()
Now, let's check the saved file.
balmungsan@BalmungSan:dailyReport $ pwd
/home/balmungsan/dailyReport
balmungsan@BalmungSan:dailyReport $ ls
part-00000-53a11fca-7112-497c-bee4-984d4ea8bbdd-c000.csv _SUCCESS
balmungsan@BalmungSan:dailyReport $ cat part-00000-53a11fca-7112-497c-bee4-984d4ea8bbdd-c000.csv
NAME,START_DATE,END_DATE,STATUS
Alex,2018-01-01 00:00:00,2018-02-01 00:00:00,OUT
Bob,2018-02-01 00:00:00,2018-02-05 00:00:00,IN
Mark,2018-02-01 00:00:00,2018-03-01 00:00:00,IN
Mark,2018-05-01 00:00:00,2018-08-01 00:00:00,OUT
Meggy,2018-02-01 00:00:00,2018-02-01 00:00:00,OUT
The _SUCCESS
file exists to signal that the writing succeed.
Important notes:
- You need to specify the
file://
protocol to save to a local filesystem, instead of in HDFS.
- The path specifies the name of the folder to save the partitions of the file, not the name of the file, inside that folder there will one file per partition. If you want to read such file again with Spark, then you only need to specify the folder, Spark will understand the partition files. If not, I would recommend rename the file after - as far as I know, there is no way to control the name from Spark.
- If the df is too big to fit in the memory of just one node, the job will fail.
- If you run this on a distributed way (e.g. with master yarn), then the file will not be saved in the master node, but in one of the slave nodes. If you really need it to be in the master node, then you may collect it and write it with normal Scala as Dmitry suggested.