0

This is my initial RDD output

scala> results
scala.collection.Map[String,Long] = Map(4.5 -> 1534824, 0.5 -> 239125, 3.0 -> 4291193, 3.5 -> 2200156, 2.0 -> 1430997, 1.5 -> 279252, 4.0 -> 5561926, 
rating -> 1, 1.0 -> 680732, 2.5 -> 883398, 5.0 -> 2898660)

I am removing a string Key to keep only numbers.

scala> val resultsInt = results.filterKeys(_ != "rating")
resultsInt: scala.collection.Map[String,Long] = Map(4.5 -> 1534824, 0.5 -> 239125, 3.0 -> 4291193, 3.5 -> 2200156, 2.0 -> 1430997, 1.5 -> 279252, 4.0 -> 5561926, 1.0 -> 680732, 2.5 -> 883398, 5.0 -> 2898660)

Sorting the RDD based on values, it gives expected output, but I would like to convert the key from String to int before sorting to get consistent output.

scala> val sortedOut2 = resultsInt.toSeq.sortBy(_._1)
sortedOut2: Seq[(String, Long)] = ArrayBuffer((0.5,239125), (1.0,680732), (1.5,279252), (2.0,1430997), (2.5,883398), (3.0,4291193), (3.5,2200156), (4.0,5561926), (4.5,1534824), (5.0,2898660))

I am new to Scala and just started writing my Spark program. Please let me know some insights to convert the key of Map object.

Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420

5 Answers5

1

Based on your sample output, I suppose you meant converting the key to Double?

val results: scala.collection.Map[String, Long] = Map(
  "4.5" -> 1534824, "0.5" -> 239125, "3.0" -> 4291193, "3.5" -> 2200156,
  "2.0" -> 1430997, "1.5" -> 279252, "4.0" -> 5561926, "rating" -> 1,
  "1.0" -> 680732, "2.5" -> 883398, "5.0" -> 2898660
)

results.filterKeys(_ != "rating").
  map{ case(k, v) => (k.toDouble, v) }.
  toSeq.sortBy(_._1)

res1: Seq[(Double, Long)] = ArrayBuffer((0.5,239125), (1.0,680732), (1.5,279252), (2.0,1430997),
   (2.5,883398), (3.0,4291193), (3.5,2200156), (4.0,5561926), (4.5,1534824), (5.0,2898660))
Leo C
  • 22,006
  • 3
  • 26
  • 39
0

To map between different type, you just need to use map Spark/Scala operator.

You can check the syntax from here Convert a Map[String, String] to Map[String, Int] in Scala

The same method can be used with Spark and Scala.

Community
  • 1
  • 1
Ahmed Kamal
  • 1,478
  • 14
  • 27
0

please see Scala - Convert keys from a Map to lower case?

the approach should be similar,

case class row (id: String, value:String)

val rddData = sc.parallelize(Seq(row("1", "hello world"), row("2", "hello there")))

rddData.map{
     currentRow => (currentRow.id.toInt, currentRow.value)}
//scala> org.apache.spark.rdd.RDD[(Int, String)]

even if you didn't define a case class for the structure of the rdd and you used something like Tuple2 instead, you can just write

currentRow._1.toInt // instead of currentRow.id.toInt

please research on casting for information (when converting from String to Int), there's a few ways to go about that

hope this helps! good luck :)

Community
  • 1
  • 1
cdncat
  • 422
  • 3
  • 8
0

Distilling your RDD into a Map is legal, but it defeats the purpose of using Spark in the first place. If you are operating at scale, your current approach renders the RDD meaningless. If you aren't, then you can just do Scala collection manipulation as you suggest, but then why bother with the overhead of Spark at all?

I would instead operate at the DataFrame level of abstraction and transform that String column into a Double like this:

import sparkSession.implicits._

dataFrame
   .select("key", "value")
   .withColumn("key", 'key.cast(DoubleType))

And this is of course assuming that Spark didn't recognize the key as a Double already after setting the inferSchema to true on initial data ingest.

Vidya
  • 29,932
  • 7
  • 42
  • 70
-1

If you are trying to filter out the key being non-number, you may just do the following:

import scala.util.{Try,Success,Failure}

(results map { case (k,v) => Try (k.toFloat) match {
  case Success(x) => Some((x,v))
  case Failure(_) => None
}}).flatten

res1: Iterable[(Float, Long)] = List((4.5,1534824), (0.5,239125), (3.0,4291193), (3.5,2200156), (2.0,1430997), (1.5,279252), (4.0,5561926), (1.0,680732), (2.5,883398), (5.0,2898660))
ryan
  • 974
  • 2
  • 7
  • 13