Suppose you have a file, let's call it udfs.py
and in it:
def nested_f(x):
return x + 1
def main_f(x):
return nested_f(x) + 1
You then want to make a UDF out of the main_f
function and run it on a dataframe:
import pyspark.sql.functions as fn
import pandas as pd
pdf = pd.DataFrame([[1], [2], [3]], columns=['x'])
df = spark.createDataFrame(pdf)
_udf = fn.udf(main_f, 'int')
df.withColumn('x1', _udf(df['x'])).show()
This works OK if we do this from within the same file as where the two functions are defined (udfs.py
). However, trying to do this from a different file (say main.py
) produces an error ModuleNotFoundError: No module named ...
:
...
import udfs
_udf = fn.udf(udfs.main_f, 'int')
df.withColumn('x1', _udf(df['x'])).show()
I noticed that if I actually nest the nested_f
inside the main_f
like this:
def main_f(x):
def nested_f(x):
return x + 1
return nested_f(x) + 1
everything runs OK. However, my goal here is to have the logic nicely separated in multiple functions, which I can also test individually.
I think this can be solved by submitting the udfs.py
file (or a whole zipped folder) to the executors using spark.sparkContext.addPyFile('...udfs.py')
. However:
- I find this a bit long-winded (esp. if you need to zip folders etc...)
- This is not always easy/possible (e.g.
udfs.py
may be using lots of other modules which then also need to be submitted, leading to bit of chain reaction...) - There are some other inconveniences with
addPyFile
(e.g. autoreload can stop working etc )
So the question is: is there a way to do all of these at the same time:
- have the logic of the UDF nicely split to several Python functions
- use the UDF from a different file than where the logic is defined
- not needing to submit any dependencies using
addPyFile
Bonus points for clarifying how this works/why this doesn't work!