0

Just getting started with Spark and Scala. We've installed Spark 2 on our dev cloudera hadoop cluster, and I'm using spark2-shell. I'm going through a book to learn some basics. It's examples show println(foo) working without doing a collect, but that's not working for me:

scala> val numbers = sc.parallelize(10 to 50 by 10)
numbers: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[9] at parallelize at <console>:24

scala> numbers.collect().foreach(println)
10                                                                              
20
30
40
50

scala> numbers.foreach(x => println(x))

scala>

As you can see, nothing prints unless I do a collect().

What's going on, is the book wrong, or is something funny with my spark/scala/config?

Version Info:

Spark version 2.0.0.cloudera2
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_111)
medloh
  • 941
  • 12
  • 33
  • 1
    What environment are you running in? If you are running on a cluster, the foreach without a collect first runs on the cluster and not your local machine. – puhlen Mar 29 '17 at 17:18
  • Thanks, guessing that's the issue. Our DEV hadoop cluster running spark has a few data nodes. – medloh Mar 29 '17 at 17:34

1 Answers1

5

That's the correct behaviour you experience, the code passed to numbers.foreach will be executed on the different nodes, but the output won't be collected and returned to the driver.

Harald Gliebe
  • 7,236
  • 3
  • 33
  • 38
  • Hmm, ok. The book was assuming you'd be running it on a simple local VM with everything running there. Maybe that's my issue, our DEV hadoop cluster has a few data nodes. – medloh Mar 29 '17 at 17:33