What is the difference between importing a module in a udf or outside the udf in pyspark?

Question

I am trying to calculate the Euclidean distance between two columns, both of them have the list of floats. I tried to calculate using the pandas_udf by using two methods - one with importing inside the function and one with importing outside. First Approach -

@pandas_udf(T.FloatType(), PandasUDFType.SCALAR)
def calculate_euclidean_distance(feature1, feature2):
    features_df = pd.DataFrame({"feature1": feature1, "feature2": feature2})
    features_df["euclidean_distance"] = features_df.apply(lambda x: distance.euclidean(x["feature1"], x["feature2"]), axis=1)
    return features_df["euclidean_distance"]

Second Approach -

@pandas_udf(T.FloatType(), PandasUDFType.SCALAR)
def calculate_euclidean_distance(feature1, feature2):
    from scipy.spatial import distance
    import pandas as pd
    features_df = pd.DataFrame({"feature1": feature1, "feature2": feature2})
    features_df["euclidean_distance"] = features_df.apply(lambda x: distance.euclidean(x["feature1"], x["feature2"]), axis=1)
    return

features_df["euclidean_distance"]

Both of them worked in my local setup of spark. I want to know what's the difference between both the approaches?

What is the difference between importing a module in a udf or outside the udf in pyspark?

0 Answers0