Import Pandas UDF With spark.sparkContext.addPyFile

Question

I have the following toy example. I am installing pandas and pyarrow onto my worker nodes with a bootstrap script. When I run the following code in a jupyter notebook it runs with no errors.

# Declare the function and create the UDF
def multiply_func(a: pd.Series, b: pd.Series) -> pd.Series:
    return a * b

@f.pandas_udf("float")
def udf_multiply(a: pd.Series, b: pd.Series) -> pd.Series:
    df = pd.DataFrame({'a': a, 'b': b})
    df['product'] = df.apply(lambda x : multiply_func(x['a'], x['b']), axis = 1)
    return df['product']

x = pd.Series([1, 2, 3])
#print(multiply_func(x, x))
# 0    1
# 1    4
# 2    9
# dtype: int64

# Create a Spark DataFrame, 'spark' is an existing SparkSession
df = spark.createDataFrame(pd.DataFrame(x, columns=["x"]))

# Execute function as a Spark vectorized UDF
df.select(udf_multiply(f.col("x"), f.col("x"))).show()

However, I have many pandas_udf's I'd like to import to my workspace and I don't want to have to copy paste each one of them at the top of my Jupyter Notebook. My desired directory structure looks like this:

eda.ipynb
helpful_pandas_udfs/toy_example.py

I've looked at other SO posts and determined that I should be able to add the Python file like so:

spark.sparkContext.addPyFile("helpful_pandas_udfs/toy_example.py")
from toy_example import udf_multiply

However when I try to run this code, I get the following error:

AttributeError: 'NoneType' object has no attribute '_jvm'

Please help! I'm completely stumped on this.

One thing to add is that I believe this error may be occurring since I don't define a Spark Context in the file I'm importing from. However, I'm not sure how to solve this problem. — zorrrba, Aug 30 '21 at 22:01
I think this question is similar to the SO post here: https://stackoverflow.com/questions/55688664/calling-another-custom-python-function-from-pyspark-udf. However the proposed solution does not work for me. — zorrrba, Aug 31 '21 at 14:19

score 0 · Accepted Answer · answered Jan 04 '22 at 19:41

0

I am able to solve this problem by copying my UDF's as text after creating the spark session. This is not a solution I'm happy with but it does work.

answered Jan 04 '22 at 19:41

zorrrba

65
1
7

Import Pandas UDF With spark.sparkContext.addPyFile

1 Answers1