1

I am trying to replicate the given code to see how foreach works, I tried with the following code:

rdd = sc.parallelize([1,2,3,4,5])

def f(a):
    print(a)

rdd.collect().foreach(f)

But it gives the following error:

AttributeError: 'list' object has no attribute 'foreach'

I understand the error that return type of collect() is a array(which is list) and it doesn't have foreach attribute associated with it but, I don't understand how this doesn't work if it's given in the official spark 3.0.1 documentation. What am I missing. I am using Spark 3.0.1

Ankit Agrawal
  • 616
  • 9
  • 20

1 Answers1

1

rdd.collect() is a Spark action that will collect the data to the driver. The collect method returns a list, therefore if you want to print out the list's elements just do:

rdd = sc.parallelize([1,2,3,4,5])

def f(a):
    print(a)

res = rdd.collect()

[f(e) for e in res]

# output
# 1
# 2
# 3
# 4
# 5

An alternative would be to use another action foreach as mentioned in your example, for instance rdd.foreach(f). Although, because of the distributed nature of Spark, there is no guarantee that the print command will send the data to the driver's output stream. This essentially means that you could never see this data printed out.

Why is that? Because the Spark has two deployment modes, cluster and client mode. In cluster mode, the driver run in an arbitrary node in the cluster. In client mode on the other side, the driver runs in the machine from which you submitted the Spark job, consequently, Spark driver and your program share the same output stream. Hence you should always be able to see the output of the program.

Related links

abiratsis
  • 7,051
  • 3
  • 28
  • 46
  • I know the solution to print a list, my question was about why `foreach` didn't work as it should according to documentation. If it is working then what am I missing here? – Ankit Agrawal Sep 17 '20 at 14:25
  • 1
    In your example you are doing `rdd.collect().foreach(f)`, as mentioned `rdd.collect()` will return a list. Of course python list do not have a `foreach` method and this is the reason for the error you posted (`AttributeError: 'list' object has no attribute 'foreach'`). – abiratsis Sep 17 '20 at 14:31
  • I added some more details about printing out the results in different modes. BTW if you run the tests on databricks, your jobs will run in cluster mode. So do not expect to see `rdd.foreach(f)` printing anything – abiratsis Sep 17 '20 at 14:46