I am using PySpark in Jupyter on Azure. I am trying to test using a UDF on a dataframe however, the UDF is not executing.
My dataframe is created by:
users = sqlContext.sql("SELECT DISTINCT userid FROM FoodDiaryData")
I have confirmed this dataframe is populated with 100 rows. In the next cell I try to execute a simple udf.
def iterateMeals(user):
print user
users.foreach(iterateMeals)
This produces no output. I would have expected each entry in the dataframe to have been printed. However, if I simply try iterateMeals('test')
it will fire and print 'test'. I also tried using pyspark.sql.functions
from pyspark.sql.functions import udf
def iterateMeals(user):
print user
f_iterateMeals = udf(iterateMeals,LongType())
users.foreach(f_iterateMeals)
When I try this, I receive the following error:
Py4JError: An error occurred while calling o461.__getnewargs__. Trace:
py4j.Py4JException: Method __getnewargs__([]) does not exist
Can someone explain where I have went wrong? I will be needing to execute udfs inside the .foreach
of dataframes for this application.