1

I am new to spark and scala. I am trying to query a table in hive(select 2 columns from the table) and convert the resulting dataframe into a Map. I am using Spark 1.6 with Scala 2.10.6.

Ex:

Dataframe:
+--------+-------+
| address| exists|
+--------+-------+
|address1|   1   |
|address2|   0   |
|address3|   1   |
+--------+-------+ 
should be converted to: Map("address1" -> 1, "address2" -> 0, "address3" -> 1)

This is the code I am using:

val testMap: scala.collection.mutable.Map[String,Any] = Map()
val df= hiveContext.sql("select address,exists from testTable")
qualys.foreach( r => {
  val key = r(0).toString
  val value = r(1)
  testMap+=(key -> value)
  }
)
testMap.foreach(println)

When I run the above code, I get this error:

java.lang.NoSuchMethodError: scala.Predef$.ArrowAssoc(Ljava/lang/Object;)Ljava/lang/Object;

It is throwing this error at the line where I am trying to add the key value pair to the Map. i.e. testMap+=(key -> value)

I know that there is a better and simpler way of doing this using the org.apache.spark.sql.functions.map. However, I am using Spark 1.6 and I don't think this function is available. I tried doing the import and I didn't find it in the list of available functions.

why is my approach giving me an error? and is there a better/elegant way of achieving this with spark 1.6?

any help would be appreciated. Thank you!

UPDATE:

I changed the way the elements are being added to the Map to the following: testMap.put(key, value).

I was previously using the += for adding the elements. Now i don't get the java.lang.NoSuchMethodError anymore. However, no elements are getting added to the testMap. After the foreach step is complete, I tried to print the size of the map and all the elements in it and I see that there are zero elements.

Why are the elements not getting added? I am also open to any other better approach. Thank you!!

Hemanth
  • 705
  • 2
  • 16
  • 32
  • oh, if that's what you need, `org.apache.spark.sql.functions.map` is irrelevant anyway. You just need to convert to `RDD[(String, Int)]` and use `collectAsMap()`. You can find posts about how to convert a DataFrame into RDD. – Tzach Zohar Oct 19 '17 at 20:42
  • That seems like an easy way! But, the result will be an immutable Map right? How do I change it to a Mutable one? – Hemanth Oct 19 '17 at 20:50
  • https://stackoverflow.com/questions/5042878/how-can-i-convert-immutable-map-to-mutable-map-in-scala – Tzach Zohar Oct 19 '17 at 20:51
  • You're awesome! Please post your comments as an answer and I'll accept it. Thanks a lot! :) – Hemanth Oct 19 '17 at 20:59

2 Answers2

3

This can be broken down into 3 steps, each one already solved on SO:

  1. Convert DataFrame to RDD[(String, Int)]
  2. Call collectAsMap() on that RDD to get an immutable map
  3. Convert that map into a mutable one (e.g. as described here)

NOTE: I don't know why you need a mutable map - it's worth noting that using a mutable collection rarely makes much sense in Scala. Sticking with immutable objects only is safer and easier to reason about. "Forgetting" about the existence of mutable collections makes learning functional APIs (like Spark's!) much easier.

Tzach Zohar
  • 37,442
  • 3
  • 79
  • 85
  • I was wondering if you can help me with this question: https://stackoverflow.com/questions/47445328/adding-a-new-column-to-a-dataframe-by-using-the-values-of-multiple-other-columns Any help is appreciated, thank you! – Hemanth Nov 22 '17 at 23:47
2

simply you can collect the data from dataframe and iterate on top of it, it will work

qualys.collect.map( r => {
val key = r(0).toString
val value = r(1)
testMap+=(key -> value)
 }
)