I have seen various postings around (such as the answer to this stackexchange post) that give something similar to the code below as a simple example of how to use the foreach()
function on a Spark DataFrame.
spark = pyspark.sql.SparkSession.builder.getOrCreate()
dftest = spark.createDataFrame([(1,"foo"), (2, "bar"), (3, "baz"), (4, "qux")]).toDF("number", "name")
def g(x): print(x)
dftest.groupBy("name").count().foreach(g)
Most of the postings I have seen show that the output should be a printout to the console of the count of each row in this simple example. But this does nothing for me. It runs successfully without throwing any errors, but it does nothing.
I have tried on an Cloudera cluster, and on a local instance of Spark in a Jupyter notebook where the Master is local[*]
. Both running Spark 2.2.0. Same results on both.
What am I doing wrong?
Using a for loop on the collect()
'ed DataFrame, as one of the answers in this post suggests, does not get to the problem I am trying to solve. collect()
returns a python list which you can then iterate over with python, it is not the same thing as iterating over, or applying a Spark function over every row DataFrame.