0

I have seen various postings around (such as the answer to this stackexchange post) that give something similar to the code below as a simple example of how to use the foreach() function on a Spark DataFrame.

spark = pyspark.sql.SparkSession.builder.getOrCreate()
dftest = spark.createDataFrame([(1,"foo"), (2, "bar"), (3, "baz"), (4, "qux")]).toDF("number", "name")
def g(x): print(x)
dftest.groupBy("name").count().foreach(g)

Most of the postings I have seen show that the output should be a printout to the console of the count of each row in this simple example. But this does nothing for me. It runs successfully without throwing any errors, but it does nothing.

I have tried on an Cloudera cluster, and on a local instance of Spark in a Jupyter notebook where the Master is local[*]. Both running Spark 2.2.0. Same results on both.

What am I doing wrong?

Spark docs on foreach()

Using a for loop on the collect()'ed DataFrame, as one of the answers in this post suggests, does not get to the problem I am trying to solve. collect() returns a python list which you can then iterate over with python, it is not the same thing as iterating over, or applying a Spark function over every row DataFrame.

Clay
  • 2,584
  • 1
  • 28
  • 63
  • 4
    It does not print to the driver console/log, but to the executors' log. Try the same with `spark.master=local` and you should see the output – Raphael Roth Dec 30 '17 at 18:45
  • Thanks @RaphaelRoth, when running the same code on a local instance of Spark in a jupyter notebook, the same thing happens. – Clay Dec 31 '17 at 14:05
  • @user6910411 : this is not a duplicate of that question. The answer to that question does not solve my problem, it is my problem. – Clay Dec 31 '17 at 14:20
  • The accepted answer explains why writing to `stdout` in `foreach` is not a valid approach. Unless your actual problem is different than "print from workers", then there is no practical difference. – zero323 Dec 31 '17 at 15:28
  • @user6910411, using print in a simple function then using foreach is the exact example in the a [Spark docs](https://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.foreach). Furthermore, the same problem occurs when using `foreach` on a single node with `master=local`. So yes, the actual problem is different than "print from workers" and there is a practical difference. – Clay Dec 31 '17 at 16:09

0 Answers0