Iterating over CompactBuffer in an rdd

Question

I have an RDD[(String, Iterable[WikipediaArticle])] which looks something like this:

(Groovy,CompactBuffer(WikipediaArticle( {has a String title} , {has some text corresponding to that title}), WikipediaArticle( {has a String title} , {has some text corresponding to that title}))

curly brackets above are just to differentiate between title and text while making things cleaner

Groovy : is the String name
WikipediaArticle: class has two attributes title and text

I need an output of type: List[(String, Int)] where:
String: is the 1st element in the RDD which is unique on each line
In the above case that is "Groovy"
Int: is the count of WikipediaArticles inside the compactbuffer for that String

I have tried to make things as clear as possible, however, if you think there are chances to improve the question or you have any doubts please feel free to ask.

You're asking for a solution to a coursera course assignment, which is against the honor code. What have you tried so far? — Zoltán, Jun 15 '17 at 20:42
I know I am stuck and I cant find any help from the discussions section either. I have tried to iterate over the compactbuffer using map but with no success. I just need a little hint I know i am doing something silly. — Kireet Bhat, Jun 15 '17 at 20:49

score 1 · Accepted Answer · answered Jun 15 '17 at 22:08

1

If you treat each element of the RDD a (k, v) pair with the first keyword being k and the CompactBuffer being v, one approach would be to use map with partial function case like in the following:

case class WikipediaArticle(title: String, text: String)

val rdd = sc.parallelize(Seq(
  ( "Groovy", Iterable( WikipediaArticle("title1", "text1"), WikipediaArticle("title2", "text2") ) ),
  ( "nifty", Iterable( WikipediaArticle("title2", "text2"), WikipediaArticle("title3", "text3") ) ),
  ( "Funny", Iterable( WikipediaArticle("title1", "text1"), WikipediaArticle("title3", "text3"), WikipediaArticle("title4", "text4") ) )
))

rdd.map{ case (k, v) => (k, v.size) }
// res1: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[1] at map at <console>:29

res1.collect.toList
// res2: List[(String, Int)] = List((Groovy,2), (nifty,2), (Funny,3))

answered Jun 15 '17 at 22:08

Leo C

22,006
3
26
39

Thanks Leo. Unfortunately for my case this solution does not work but it does when I do this instead `index.map(k => (k._1,k._2.iterator.size)).collect().toList`. I am not sure why it is unable to treat the rdd as case(k,v) pair as you suggested above. Can't say clearly what the error is. – Kireet Bhat Jun 16 '17 at 00:23
@Kireet Bhat, did you make sure `{ }`, not `( )`, were used for enclosing the `case` partial function? – Leo C Jun 16 '17 at 00:53
it works now!! Now that is something totally new. I did not know that there was a difference between usage of curly and round brackets. Do you know why it didn't work with ( ) earlier but now with { } it does. – Kireet Bhat Jun 16 '17 at 01:04
Scala requires function/partial function literals (including `case`) to be enclosed by curly braces. Here's a [link](https://stackoverflow.com/a/4387118/6316508) re: curly braces vs parentheses. – Leo C Jun 16 '17 at 01:24
I should have done that earlier. Sorry. Thanks for your help. :) – Kireet Bhat Jun 16 '17 at 01:27
Glad that it helps. Also good to see you came up with an alternative solution. – Leo C Jun 16 '17 at 01:35

Iterating over CompactBuffer in an rdd

1 Answers1