1

I am using PySpark in Jupyter on Azure. I am trying to test using a UDF on a dataframe however, the UDF is not executing.

My dataframe is created by:

users = sqlContext.sql("SELECT DISTINCT userid FROM FoodDiaryData")

I have confirmed this dataframe is populated with 100 rows. In the next cell I try to execute a simple udf.

def iterateMeals(user):
    print user

users.foreach(iterateMeals)

This produces no output. I would have expected each entry in the dataframe to have been printed. However, if I simply try iterateMeals('test') it will fire and print 'test'. I also tried using pyspark.sql.functions

from pyspark.sql.functions import udf

def iterateMeals(user):
    print user
f_iterateMeals = udf(iterateMeals,LongType())

users.foreach(f_iterateMeals)

When I try this, I receive the following error:

Py4JError: An error occurred while calling o461.__getnewargs__. Trace:
py4j.Py4JException: Method __getnewargs__([]) does not exist

Can someone explain where I have went wrong? I will be needing to execute udfs inside the .foreach of dataframes for this application.

zero323
  • 322,348
  • 103
  • 959
  • 935
Stevenyc091
  • 195
  • 1
  • 2
  • 22

1 Answers1

2
  1. You won't see an output because print is executed on worker nodes and goes to the respective output. See Why does foreach not bring anything to the driver program? for a complete explanation.

  2. foreach operates on a RDD not a DataFrame. UDFs are not valid in this context.

Community
  • 1
  • 1
zero323
  • 322,348
  • 103
  • 959
  • 935