My project includes many operations between numpy arrays and numpy matrices that are currently performed within UDF, do you think if we used the internal structures in PySpark, we would have an increase in performance? (matrix --> dataframe, numpy array --> dense vectors) Thank you!
Asked
Active
Viewed 653 times
2
-
have you considered using `pandas_udf` ? – Steven Sep 09 '20 at 15:19
1 Answers
1
UDFs are generally slower than pyspark.sql.functions
working on DataFrame API, you should generally avoid those as much as possible due to serialisation deserialisation overhead.

Samir Vyas
- 442
- 2
- 6
-
Yes but I mean is there any advantage to use in example dense vectors instead of numpy array INSIDE the UDF? Because for me is impossible to avoid the use of UDFs – CHIRAQA Sep 09 '20 at 15:20