PySpark: UDF is not executing on a dataframe

Question

I am using PySpark in Jupyter on Azure. I am trying to test using a UDF on a dataframe however, the UDF is not executing.

My dataframe is created by:

users = sqlContext.sql("SELECT DISTINCT userid FROM FoodDiaryData")

I have confirmed this dataframe is populated with 100 rows. In the next cell I try to execute a simple udf.

def iterateMeals(user):
    print user

users.foreach(iterateMeals)

This produces no output. I would have expected each entry in the dataframe to have been printed. However, if I simply try iterateMeals('test') it will fire and print 'test'. I also tried using pyspark.sql.functions

from pyspark.sql.functions import udf

def iterateMeals(user):
    print user
f_iterateMeals = udf(iterateMeals,LongType())

users.foreach(f_iterateMeals)

When I try this, I receive the following error:

Py4JError: An error occurred while calling o461.__getnewargs__. Trace:
py4j.Py4JException: Method __getnewargs__([]) does not exist

Can someone explain where I have went wrong? I will be needing to execute udfs inside the .foreach of dataframes for this application.

score 2 · Accepted Answer · edited May 23 '17 at 10:28

2

You won't see an output because print is executed on worker nodes and goes to the respective output. See Why does foreach not bring anything to the driver program? for a complete explanation.
foreach operates on a RDD not a DataFrame. UDFs are not valid in this context.

edited May 23 '17 at 10:28

Community

1
1

answered Mar 24 '16 at 14:42

zero323

322,348
103
959
935

PySpark: UDF is not executing on a dataframe

1 Answers1