I have implemented a classification algorithm in Spark that involves calculating distances between instances. The implementation uses dataframes (and raw SQL where possible). I transform the features of the instances into a vector so I can apply a Scaler and to end up with a uniform schema regardless of how many features my dataset happens to have.
As far as I understand, Spark SQL can't do calculations with vector columns. So in order to calculate the distance between instances, I've had to define a python function and register it as a UDF. But I see warnings against using UDFs because the dataframe engine "can't optimise UDFs".
My questions are:
- Is it correct that there is no way to calculate the distance between two feature vectors within SQL (not using a UDF)?
- Can the use of a UDF to calculate the distance between vectors have a large impact on performance, or is there nothing for Spark to optimise here anyway?
- Is there some other consideration I've missed?
To be clear, I'm hoping the answer is either
- "You're doing it wrong, this is indeed inefficient, here's how to do it instead: ...", or
- "UDFs are not intrinsically inefficient, this is a perfectly good use for them and there's no opimisation you're missing out on"