I am trying to calculate the Euclidean distance between two columns, both of them have the list of floats. I tried to calculate using the pandas_udf by using two methods - one with importing inside the function and one with importing outside. First Approach -
@pandas_udf(T.FloatType(), PandasUDFType.SCALAR)
def calculate_euclidean_distance(feature1, feature2):
features_df = pd.DataFrame({"feature1": feature1, "feature2": feature2})
features_df["euclidean_distance"] = features_df.apply(lambda x: distance.euclidean(x["feature1"], x["feature2"]), axis=1)
return features_df["euclidean_distance"]
Second Approach -
@pandas_udf(T.FloatType(), PandasUDFType.SCALAR)
def calculate_euclidean_distance(feature1, feature2):
from scipy.spatial import distance
import pandas as pd
features_df = pd.DataFrame({"feature1": feature1, "feature2": feature2})
features_df["euclidean_distance"] = features_df.apply(lambda x: distance.euclidean(x["feature1"], x["feature2"]), axis=1)
return
features_df["euclidean_distance"]
Both of them worked in my local setup of spark. I want to know what's the difference between both the approaches?