-1

I have implemented a classification algorithm in Spark that involves calculating distances between instances. The implementation uses dataframes (and raw SQL where possible). I transform the features of the instances into a vector so I can apply a Scaler and to end up with a uniform schema regardless of how many features my dataset happens to have.

As far as I understand, Spark SQL can't do calculations with vector columns. So in order to calculate the distance between instances, I've had to define a python function and register it as a UDF. But I see warnings against using UDFs because the dataframe engine "can't optimise UDFs".

My questions are:

  • Is it correct that there is no way to calculate the distance between two feature vectors within SQL (not using a UDF)?
  • Can the use of a UDF to calculate the distance between vectors have a large impact on performance, or is there nothing for Spark to optimise here anyway?
  • Is there some other consideration I've missed?

To be clear, I'm hoping the answer is either

  • "You're doing it wrong, this is indeed inefficient, here's how to do it instead: ...", or
  • "UDFs are not intrinsically inefficient, this is a perfectly good use for them and there's no opimisation you're missing out on"
oulenz
  • 1,199
  • 1
  • 15
  • 24
  • 3
    Sometimes UDFs are unavoidable but without a specific example, any answer would likely be speculative. [How to create good reproducible spark examples](https://stackoverflow.com/questions/48427185/how-to-make-good-reproducible-apache-spark-examples). – pault Feb 26 '19 at 15:07

1 Answers1

0

UDF are not efficient nor optimized, and are not transferred to jvm code especially if you use PySpark, there is pickle object created, OS spent lots of resources to transfer from jvm in/out. I have implemented something in pyspark using udf for geolocation and it would never finish in a few days on the other hand implemented in scala it has finished in a few hours. Do it in scala if you have to do it. Maybe that can help https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/CosineSimilarity.scala

Dmitry
  • 220
  • 1
  • 11