Spark Scala println requires a collect()?

Question

Just getting started with Spark and Scala. We've installed Spark 2 on our dev cloudera hadoop cluster, and I'm using spark2-shell. I'm going through a book to learn some basics. It's examples show println(foo) working without doing a collect, but that's not working for me:

scala> val numbers = sc.parallelize(10 to 50 by 10)
numbers: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[9] at parallelize at <console>:24

scala> numbers.collect().foreach(println)
10                                                                              
20
30
40
50

scala> numbers.foreach(x => println(x))

scala>

As you can see, nothing prints unless I do a collect().

What's going on, is the book wrong, or is something funny with my spark/scala/config?

Version Info:

Spark version 2.0.0.cloudera2
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_111)

What environment are you running in? If you are running on a cluster, the foreach without a collect first runs on the cluster and not your local machine. — puhlen, Mar 29 '17 at 17:18
Thanks, guessing that's the issue. Our DEV hadoop cluster running spark has a few data nodes. — medloh, Mar 29 '17 at 17:34

score 5 · Accepted Answer · answered Mar 29 '17 at 17:18

5

That's the correct behaviour you experience, the code passed to numbers.foreach will be executed on the different nodes, but the output won't be collected and returned to the driver.

answered Mar 29 '17 at 17:18

Harald Gliebe

7,236
3
33
38

Hmm, ok. The book was assuming you'd be running it on a simple local VM with everything running there. Maybe that's my issue, our DEV hadoop cluster has a few data nodes. – medloh Mar 29 '17 at 17:33

Spark Scala println requires a collect()?

1 Answers1