Computing SVD using pyspark:
rdd = MLUtils.convertVectorColumnsFromML(df.select("ID", "TF_IDF")).rdd
index_mat = IndexedRowMatrix(rdd)
print('index_mat rows = {}'.format(index_mat.numRows()))
print('index_mat columns = {}'.format(index_mat.numCols()))
svd = index_mat.computeSVD(k=100, computeU=True)
Output:
index_mat rows = 2000
index_mat columns = 6000
spark df
is having 100 partitions and I am running this job with 20 executors.
It's taking more than an hour. While similar code using scipy is running in 1 minute.
from scipy.sparse.linalg import svds
u, s, vt = svds(tfidf_sparse, k=100)