I have these Rows:
(key1,Illinois|111|67342|...)
(key1,Illinois|121|67142|...)
(key2,Hawaii|113|67343|...)
(key1,Illinois|211|67442|...)
(key3,Hawaii|153|66343|...)
(key3,Ohio|193|68343|...)
(1) How do I get the unique keys?
(2) How do I get the number of rows PER key (key1 - 3 rows, key2 - 1 row, key 3 - 2 rows... so the output would be: 3,1,2)
(3) How do I get the byte size of rows PER key (5MB,2MB,3MB)
EDIT 1. this is my new code:
val rdd : RDD[(String, Array[String])] = ...
val rdd_res = rdd.groupByKey().map(row => (row._1, row._2.size, byteSize(row._2)))
val rddKeys = rdd_res.map(row => row._1)
val rddCount = rdd_res.map(row => row._2)
val rddByteSize = rdd_res.map(row => row._3)
How do I implement the byteSize? I want to get the size that will be saved to disk.
EDIT 2.
val rdd_res : RDD[(String, (Int, Int))] = rdd.aggregateByKey((0,0))((accum, value) => (accum._1 + 1, accum._2 + size(value)), (first, second) => (first._1 + second._1, first._2 + second._2))
val rdd_res_keys = rdd_res.map(row=>row._1).collect().mkString(",")
val rdd_res_count = rdd_res.map(row=>row._2).collect().map(_._1).mkString(",")
val rdd_res_bytes = rdd_res.map(row=>row._2).collect().map(_._2).mkString(",")