1

I encountered a non-understandable problem. When I tested below code with IDE like IntelliJ locally, the result is true.

    val sparkConf = new SparkConf().setAppName("QueryMySql").setMaster("local")
val sc = new SparkContext(sparkConf)

case class Store(val name : String) {
  override def equals(o : Any) = o match {
    case that: Store => that.name.equals(this.name)
    case _ => false
  }
  override def hashCode = name.hashCode
}

val storeAddressList = List(
  (Store("Candy") , "dongilStreet 1"),
  (Store("Choco") , "kangnam Street 2"),
  (Store("Choco") , "bongchen Street 3"),
  (Store("Icecream") , "samsung street 4")
)
val storeAddress = sc.parallelize(storeAddressList)

val storeRatingList = List(
  (Store("Candy") , 4.9),
  (Store("Choco") , 4.8)
)
val storeRating = sc.parallelize(storeRatingList)

storeAddress.collect
storeRating.collect
println(storeAddress.first._1.equals(storeRating.first._1))

However, I ran the same code that first and second line removed with spark-shell. It was:

   case class Store(val name : String) {
  override def equals(o : Any) = o match {
    case that: Store => that.name.equals(this.name)
    case _ => false
  }
  override def hashCode = name.hashCode
}

val storeAddressList = List(
  (Store("Candy") , "dongilStreet 1"),
  (Store("Choco") , "kangnam Street 2"),
  (Store("Choco") , "bongchen Street 3"),
  (Store("Icecream") , "samsung street 4")
)
val storeAddress = sc.parallelize(storeAddressList)

val storeRatingList = List(
  (Store("Candy") , 4.9),
  (Store("Choco") , 4.8)
)
val storeRating = sc.parallelize(storeRatingList)

storeAddress.collect
storeRating.collect
println(storeAddress.first._1.equals(storeRating.first._1))

and result is false.

To find out the cause, I have tried these: First of all, I checked storeRating.first._1 because storeAddress has 4 partitions and each partition has value. In contrast, storeRating has 2 partitions and only two of them have value. So, I thought it would be

Store(Candy).equals(null)

but, it was wrong, they have value.

Second, I suspected the hashcode of Store case class, but they have the same hash code unfortunately.

scala>     println(storeAddress.first._1.hashCode())
64874565
scala>     println(storeRating.first._1.hashCode())
64874565

Third, I checked the value of storeAddress and storeRating, and they are also the same.

Please, help me find the cause of this bizarre situation.

jongseok
  • 153
  • 2
  • 9
  • I think so, but when I executed println(storeAddress.first._1.equals(storeRating.first._1)), it was println(Store(Candy).equals(Store(Candy)) – jongseok Mar 17 '16 at 01:35
  • I agreed about seeing RDDs as a whole, but you should know they always show me the same result. That is why I posted it. – jongseok Mar 17 '16 at 01:45
  • Please give me detailed explanation. I don't understand that the same code running on IDE and spark-shell shows different result because of few partition and nodes. – jongseok Mar 17 '16 at 01:53
  • These are only comments, I prefer that someone else with more experience answers this question. However, in my small experience it's because of the number of partitions and how Spark *splits* the data among all nodes. In the first case you have only one node, on the other case you have more (at least I bet) you have to take a look on `spark-defaults`, and this is the main source of this behavior. – Alberto Bonsanto Mar 17 '16 at 01:58
  • I tested both of them locally. Thank you. – jongseok Mar 17 '16 at 02:08
  • 1
    Running something locally doesn't mean that there isn't any level of *parallelism*. Take a look [Spark Configuration](https://dl.dropboxusercontent.com/u/6280514/ShareX/2016/03/firefox_2016-03-16_21-39-56.png) – Alberto Bonsanto Mar 17 '16 at 02:10
  • I know most of your recommendation, and think they are important. However I only want to know the cause of my situation even they didn't run on parallel. – jongseok Mar 17 '16 at 02:15
  • 2
    @jongseok possible duplicate of http://stackoverflow.com/questions/35301998/pattern-matching-case-class-spark-scala . Also, refer this https://mail-archives.apache.org/mod_mbox/spark-user/201501.mbox/%3CBB305648A4EF854ABE74310E31D6D2A2021672B648@WMUCV491.wwg00m.rootdom.net%3E – Akash Mar 17 '16 at 08:57

0 Answers0