svd performance pyspark vs scipy

Question

Computing SVD using pyspark:

rdd = MLUtils.convertVectorColumnsFromML(df.select("ID", "TF_IDF")).rdd
index_mat = IndexedRowMatrix(rdd)

print('index_mat rows = {}'.format(index_mat.numRows()))
print('index_mat columns = {}'.format(index_mat.numCols()))

svd = index_mat.computeSVD(k=100, computeU=True)

Output:

index_mat rows = 2000

index_mat columns = 6000

spark df is having 100 partitions and I am running this job with 20 executors.

It's taking more than an hour. While similar code using scipy is running in 1 minute.

from scipy.sparse.linalg import svds

u, s, vt = svds(tfidf_sparse, k=100)

Mohamed Ali JAMAOUI · Answer 1 · 2019-11-12T09:15:21.313

For small datasets, distributed systems like spark have a disadvantage. They start to be useful when the data you want to handle doesn't fit in a single machine's memory.

Here's an incomplete list of potential other reasons why spark is slower than scipy:

First because of the network communication time:

For small datasets, that fit in a single machine's memory, tools like pandas, numpy and scipy which use a single node, will spend less time moving data around and focus on the actual computation. Whereas the 20 executors that you are using in spark will have to spend more time moving data through the network. So, for the distributed system, other factors like the network speed, bandwidth and congestion level can influence the performance.
It's easier to install scipy with optimal settings compared to installing spark with optimal settings:

It's easier to install/configure Scipy with BLAS: a set of accelerated linear algebra routines, compared to installing the same dependencies for spark. For instance, if you are using Scipy through conda (from the anaconda distribution), it comes already with blas dependencies well configured. Whereas, Spark by default uses a vanilla java implementation of linear algebra operations and asks you to configure blas yourself (on each executor) to get better peformance (check the mllib dependencies for more information). Chances are your system doesn't have the BLAS dependencies installed.
You are using the old RDD based machine learning library: mllib API.

You should use the newer ML API version. Several Stack-overflow threads explain why you should move to the newer API. You can check this one to get the general idea: What's the difference between Spark ML and MLLIB packages

In general, you should use the APIs from pyspark.ml instead of pypsark.mllib (org.apache.spark.ml instead of org.apache.spark.mllib if you are using scala). So try to rewrite your code with the ml API and benchmark again.

Not to mention that spark waits for resources at the beginning of each execution, which can slow down the overall time of the job depending on the capacity of your cluster.

If you need more details, please provide a reproducible example, including the data and more information about the size of your dataset (number of observations and size in GB).

@devツ it's not. But I am trying to give the general idea that besides what I mentioned above, RDD based APIs are slower than DataFrame based APIs. — Mohamed Ali JAMAOUI, Nov 11 '19 at 15:24

svd performance pyspark vs scipy

1 Answers1