I have the following toy example. I am installing pandas and pyarrow onto my worker nodes with a bootstrap script. When I run the following code in a jupyter notebook it runs with no errors.
# Declare the function and create the UDF
def multiply_func(a: pd.Series, b: pd.Series) -> pd.Series:
return a * b
@f.pandas_udf("float")
def udf_multiply(a: pd.Series, b: pd.Series) -> pd.Series:
df = pd.DataFrame({'a': a, 'b': b})
df['product'] = df.apply(lambda x : multiply_func(x['a'], x['b']), axis = 1)
return df['product']
x = pd.Series([1, 2, 3])
#print(multiply_func(x, x))
# 0 1
# 1 4
# 2 9
# dtype: int64
# Create a Spark DataFrame, 'spark' is an existing SparkSession
df = spark.createDataFrame(pd.DataFrame(x, columns=["x"]))
# Execute function as a Spark vectorized UDF
df.select(udf_multiply(f.col("x"), f.col("x"))).show()
However, I have many pandas_udf's I'd like to import to my workspace and I don't want to have to copy paste each one of them at the top of my Jupyter Notebook. My desired directory structure looks like this:
eda.ipynb
helpful_pandas_udfs/toy_example.py
I've looked at other SO posts and determined that I should be able to add the Python file like so:
spark.sparkContext.addPyFile("helpful_pandas_udfs/toy_example.py")
from toy_example import udf_multiply
However when I try to run this code, I get the following error:
AttributeError: 'NoneType' object has no attribute '_jvm'
Please help! I'm completely stumped on this.