This is what you could do, Here is a simple example
import spark.implicits._
val data = spark.sparkContext.parallelize(Seq(
(29,"City 2", 72),
(28,"City 3", 48),
(28,"City 2", 19),
(27,"City 2", 16),
(28,"City 1", 84),
(28,"City 4", 72),
(29,"City 4", 39),
(27,"City 3", 42),
(26,"City 3", 68),
(27,"City 1", 89),
(27,"City 4", 104),
(26,"City 2", 19),
(29,"City 3", 27)
)).toDF("week", "city", "sale")
//create a dataframe with dummy data
//get list of cities
val city = data.select("city").distinct.collect().flatMap(_.toSeq)
// get all the columns for each city
//this returns Seq[(Any, Dataframe)] as (cityId, Dataframe)
val result = city.map(c => (c -> data.where(($"city" === c))))
//print all the dataframes
result.foreach(a=>
println(s"Dataframe with ${a._1}")
a._2.show()
})
Output looks Like this
Dataframe with City 1
+----+------+----+
|week| city|sale|
+----+------+----+
| 28|City 1| 84|
| 27|City 1| 89|
+----+------+----+
Dataframe with City 3
+----+------+----+
|week| city|sale|
+----+------+----+
| 28|City 3| 48|
| 27|City 3| 42|
| 26|City 3| 68|
| 29|City 3| 27|
+----+------+----+
Dataframe with City 4
+----+------+----+
|week| city|sale|
+----+------+----+
| 28|City 4| 72|
| 29|City 4| 39|
| 27|City 4| 104|
+----+------+----+
Dataframe with City 2
+----+------+----+
|week| city|sale|
+----+------+----+
| 29|City 2| 72|
| 28|City 2| 19|
| 27|City 2| 16|
| 26|City 2| 19|
+----+------+----+
You can also use partitionby
to group the data and write to the output as
dataframe.write.partitionBy("col").saveAsTable("outputpath")
this creates a output file for each grouped of "col"
Hope this helps!