PySpark performance DataFrame/Vectors vs Numpy Array

Question

My project includes many operations between numpy arrays and numpy matrices that are currently performed within UDF, do you think if we used the internal structures in PySpark, we would have an increase in performance? (matrix --> dataframe, numpy array --> dense vectors) Thank you!

have you considered using `pandas_udf` ? – Steven Sep 09 '20 at 15:19 — Steven, Sep 09 '20 at 15:19

score 1 · Answer 1 · answered Sep 09 '20 at 15:05

1

UDFs are generally slower than pyspark.sql.functions working on DataFrame API, you should generally avoid those as much as possible due to serialisation deserialisation overhead.

Spark functions vs UDF performance?

answered Sep 09 '20 at 15:05

Samir Vyas

442
2
6

Yes but I mean is there any advantage to use in example dense vectors instead of numpy array INSIDE the UDF? Because for me is impossible to avoid the use of UDFs – CHIRAQA Sep 09 '20 at 15:20

PySpark performance DataFrame/Vectors vs Numpy Array

1 Answers1