0

I've tried working this out and think "flatten" might be part of my solution but I just can't work it out.

Imagine:

case class Thing (value1: Int, value2: Int)
case class Container (string1: String, listOfThings: List[Thing], string2: String)

So my list:

List[Container]

could be any size but for now we'll just have 3.

Inside each Container there is a list

listofthings[Thing]

that could also have number of type Thing in it, for now we'll also just have 3.

So what I want to get is something like

fullListOfThings[Thing] = List(Thing(1,1), Thing(1,2), Thing(1,3),
    Thing(2,1), Thing(2,2), Thing(2,3), Thing(3,1), Thing(3,2), Thing(3,3))

The first value in Thing being it's Container number and the second value being the Thing number in that Container.

I hope all this makes sense.

To make it more complicated for me, my list of Container is not actually a list but rather an RDD,

RDD rddOfContainers[Container]

and what I need at the end is an RDD of Things

fullRddOfThings[Thing]

In the Java that I am more used to this would be pretty straight forward but Scala is different. I'm pretty new to Scala and am having to learn this on the fly so any full explanation would be very welcome.

I want to avoid bringing in too much external libraries if I can. In the mean time I'll keep reading. Thanks

Roy Wood
  • 79
  • 1
  • 9

2 Answers2

2

Having RDD as well any other proper scala collection, you could use flatMap for such operations

val containers = sc.parallelize(Seq(
  Container("",List(Thing(1,2), Thing(2,3)),""), 
  Container("", Nil,""), 
  Container("",List(Thing(3,4)),"")))
//containers: org.apache.spark.rdd.RDD[Container]
val things = containers flatMap (_.listOfThings)
//things: org.apache.spark.rdd.RDD[Thing]
things.collect()
//res2: Array[Thing] = Array(Thing(1,2), Thing(2,3), Thing(3,4))
Odomontois
  • 15,918
  • 2
  • 36
  • 71
  • Wow, such a succinct solution! For the sake of my learning, can you tell me what the underscore represents here? Does it refer to a row (which is an entry of type Container) in the Containers rdd? Is there a more explicit way I can write this rather than use the underscore, for the sake of readability while I'm still learning. – Roy Wood Jul 02 '15 at 12:38
  • I worked it out. val things = containers.flatMap(rowContainer => rowContainer.listOfThings) Thank You so much! – Roy Wood Jul 02 '15 at 13:02
  • @RoyWood please refer http://stackoverflow.com/questions/8000903/what-are-all-the-uses-of-an-underscore-in-scala – Odomontois Jul 02 '15 at 13:49
0
var list = rddOfContainers.flatMap(x => x.listOfThings).flatMap(y => y)
var rddOfThings = sc.parallelize(list)
Lukas Eichler
  • 5,689
  • 1
  • 24
  • 43
  • 1
    Templar, I'm not sure what y is supposed to represent. Also intelliJ doesn't seem to like the 2nd flatMap, are you sure this is right? – Roy Wood Jul 02 '15 at 12:54
  • Probably naming error because the variable that holds the list of things and the result variable were named the same. I update my code. – Lukas Eichler Jul 02 '15 at 15:03