2

My project includes many operations between numpy arrays and numpy matrices that are currently performed within UDF, do you think if we used the internal structures in PySpark, we would have an increase in performance? (matrix --> dataframe, numpy array --> dense vectors) Thank you!

CHIRAQA
  • 33
  • 5

1 Answers1

1

UDFs are generally slower than pyspark.sql.functions working on DataFrame API, you should generally avoid those as much as possible due to serialisation deserialisation overhead.

Spark functions vs UDF performance?

Samir Vyas
  • 442
  • 2
  • 6
  • Yes but I mean is there any advantage to use in example dense vectors instead of numpy array INSIDE the UDF? Because for me is impossible to avoid the use of UDFs – CHIRAQA Sep 09 '20 at 15:20