-2

my data =

aaaa|1000
bbb|1000
ccc|1000
aaaa|1000
aaaa|2000
aaaa|3000
aaaa|2000
aaaa|1000
bbb|2000
bbb|2000
ccc|1000
ccc|1000
ccc|2000
ccc|3000
ccc|4000

I want to count number of occurences for each numeric value, for each textual label:

aaaa||1000||3||2000||2||3000||1
bbb||2000||2||1000||1
ccc||1000||3||4000||1||2000||1||3000||1

This is my code

val UserShopRowData = inputData.map( s => (s.replace("|", " ").split(" "))).map( s => (s(0), s(1)))
val u1 = UserShopRowData.map(s=> (s, 1)).reduceByKey(_+_)
val u2 = u1.map(s => (s._1._1, s._1._2, s._2 ))
val u3 = u2.toLocalIterator.toList.sortBy(s => (s._1, s._3 )).reverse

and this is the result I'm getting:

(ccc,1000,3)
(ccc,4000,1)
(ccc,2000,1)
(ccc,3000,1)
(bbb,2000,2)
(bbb,1000,1)
(aaaa,1000,3)
(aaaa,2000,2)
(aaaa,3000,1)

please give me a solution or advice.

Tzach Zohar
  • 37,442
  • 3
  • 79
  • 85
조상진
  • 5
  • 4

2 Answers2

0
input
.map(r=>r.split("\\|"))          // do basic word count on input data first
.map(r=> ((r(0), r(1)),1))                        
.reduceByKey(_ + _)
.map(r=>(r._1._1,(r._1._2 + "||" + r._2))) // split key and aggregate again
.reduceByKey((a,b)=> a+"||" + b)
.map(r=>r._1 + "||" + r._2)
banjara
  • 3,800
  • 3
  • 38
  • 61
  • hi~ thanks very much. I learnd your answer. r.split("\\|")) I before use r.split("|") and I hope value sort. I wait your answer. Have a nice day. – 조상진 Aug 26 '16 at 02:34
  • | is a metacharacter in regex. You'd need to escape it, refer http://stackoverflow.com/questions/21524642/splitting-string-with-pipe-character – banjara Aug 26 '16 at 05:34
0

Looks like you're almost there - you just need another groupBy and some mapping to get the desired structure. Altogether this can be done as follows:

// counting occurrences and reformatting into Tuple3's:
val countByTuple: RDD[(String, String, Int)] = inputData.map(_.split('|').toList)
  .map(s => (s, 1))
  .reduceByKey(_ + _)
  .map { case (List(label, number), count) => (label, number, count) }

// grouping by text label only, and reformatting into desired structure 
val result: RDD[(String, Iterable[(String, Int)])] = countByTuple.groupBy(_._1)
  .map { case (key, iter) => (key, iter.map(t => (t._2, t._3))) }

result.foreach(println)
// prints:
// (aaaa,List((1000,3), (2000,2), (3000,1)))
// (bbb,List((2000,2), (1000,1)))
// (ccc,List((1000,3), (4000,1), (3000,1), (2000,1)))
Tzach Zohar
  • 37,442
  • 3
  • 79
  • 85
  • Hi~ thanks your answer. I execute your code sample in my computer. The result I'm getting "scala.MatchError: (List(ccc<, , , >2000),1) (of class scala.Tuple2)" . I now adjusting code. – 조상진 Aug 26 '16 at 02:12