1

I am trying to calculate the Euclidean distance between two columns, both of them have the list of floats. I tried to calculate using the pandas_udf by using two methods - one with importing inside the function and one with importing outside. First Approach -

@pandas_udf(T.FloatType(), PandasUDFType.SCALAR)
def calculate_euclidean_distance(feature1, feature2):
    features_df = pd.DataFrame({"feature1": feature1, "feature2": feature2})
    features_df["euclidean_distance"] = features_df.apply(lambda x: distance.euclidean(x["feature1"], x["feature2"]), axis=1)
    return features_df["euclidean_distance"]

Second Approach -

@pandas_udf(T.FloatType(), PandasUDFType.SCALAR)
def calculate_euclidean_distance(feature1, feature2):
    from scipy.spatial import distance
    import pandas as pd
    features_df = pd.DataFrame({"feature1": feature1, "feature2": feature2})
    features_df["euclidean_distance"] = features_df.apply(lambda x: distance.euclidean(x["feature1"], x["feature2"]), axis=1)
    return

features_df["euclidean_distance"]

Both of them worked in my local setup of spark. I want to know what's the difference between both the approaches?

ashish14
  • 650
  • 1
  • 8
  • 20

0 Answers0