0

I'm working with a spark dataframe (in scala), and what I'd like to do is group by a column and turn the different groups as a sequence of dataframes.

So it would look something like

df.groupyby("col").toSeq  -> Seq[DataFrame]

Even better would be to turn it into something with a key pair

df.groupyby("col").toSeq  -> Dict[key, DataFrame]

This seems like an obvious thing to do, but I can't seem to figure out how it might work

zero323
  • 322,348
  • 103
  • 959
  • 935
GeorgeWilson
  • 562
  • 6
  • 17

1 Answers1

2

This is what you could do, Here is a simple example

import spark.implicits._
val data = spark.sparkContext.parallelize(Seq(
  (29,"City 2", 72),
  (28,"City 3", 48),
  (28,"City 2", 19),
  (27,"City 2", 16),
  (28,"City 1", 84),
  (28,"City 4", 72),
  (29,"City 4", 39),
  (27,"City 3", 42),
  (26,"City 3", 68),
  (27,"City 1", 89),
  (27,"City 4", 104),
  (26,"City 2", 19),
  (29,"City 3", 27)
)).toDF("week", "city", "sale")
//create a dataframe with dummy data


//get list of cities 
val city = data.select("city").distinct.collect().flatMap(_.toSeq)

// get all the columns for each city
//this returns Seq[(Any, Dataframe)] as (cityId, Dataframe)
val result = city.map(c => (c -> data.where(($"city" === c))))

//print all the dataframes  
result.foreach(a=>
  println(s"Dataframe with ${a._1}")
  a._2.show()
})

Output looks Like this

Dataframe with City 1

+----+------+----+
|week|  city|sale|
+----+------+----+
|  28|City 1|  84|
|  27|City 1|  89|
+----+------+----+

Dataframe with City 3

+----+------+----+
|week|  city|sale|
+----+------+----+
|  28|City 3|  48|
|  27|City 3|  42|
|  26|City 3|  68|
|  29|City 3|  27|
+----+------+----+

Dataframe with City 4

+----+------+----+
|week|  city|sale|
+----+------+----+
|  28|City 4|  72|
|  29|City 4|  39|
|  27|City 4| 104|
+----+------+----+

Dataframe with City 2

+----+------+----+
|week|  city|sale|
+----+------+----+
|  29|City 2|  72|
|  28|City 2|  19|
|  27|City 2|  16|
|  26|City 2|  19|
+----+------+----+

You can also use partitionby to group the data and write to the output as

dataframe.write.partitionBy("col").saveAsTable("outputpath")

this creates a output file for each grouped of "col"

Hope this helps!

koiralo
  • 22,594
  • 6
  • 51
  • 72