1

I am having a huge list of names-surnames and I am trying to merge them. For example 'Michael Jordan' with Jordan Michael.

I am doing the following procedure using pyspark:

  1. Calculate tfidf -> compute cos similarity -> convert to sparse matrix
  2. calculate string distance matrix -> convert to dense matrix
  3. element-wise multiplication between tfidf sparse matrix and string distance dense matrix to calculate the 'final similarity'

This works ok for 10000 names but I doubt about how long it will take to calculate a million names similarity as each matrix is 1000000x1000000 (As the matrices are symmetric I am taking only the upper triangle matrix but that doesn't change so much the high complexity time that is needed).

I have read that after computing the tfidf it is really useful to compute the SVD of the output matrices to reduce the dimensions. From the documentation I couldn't find an example of computeSVD for pyspark. It doesn't exist?

And how can SVD can help in my case to reduce the high memory and computational time?

Any feedback and ideas are welcome.

zero323
  • 322,348
  • 103
  • 959
  • 935
Mpizos Dimitris
  • 4,819
  • 12
  • 58
  • 100
  • Possible duplicate of [Pyspark and PCA: How can I extract the eigenvectors of this PCA? How can I calculate how much variance they are explaining?](http://stackoverflow.com/questions/33428589/pyspark-and-pca-how-can-i-extract-the-eigenvectors-of-this-pca-how-can-i-calcu) – eliasah Feb 13 '16 at 06:41

2 Answers2

3

Just to update this, computeSVD is now availabe in the PySpark mllib API for RowMatrix and IndexedRowMatrix.

https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.linalg.distributed.RowMatrix

https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.linalg.distributed.IndexedRowMatrix

Evan Zamir
  • 8,059
  • 14
  • 56
  • 83
  • 2
    It seems the computeSVD is very slow. It took 26 hours to compute the matrix with 20M rows and 100k columns with sparsity of 0.003. Do you have any benchmark on a large sparse matrice? – eSadr Aug 01 '18 at 00:54
2

I couldn't find an example of computeSVD for pyspark. It doesn't exist?

No, it doesn't. As for now (Spark 1.6.0 / Spark 2.0.0 SNAPSHOT) computeSVD is available only in Scala API. You can use solution provided by eliasah here:

Pyspark and PCA: How can I extract the eigenvectors of this PCA? How can I calculate how much variance they are explaining?

And how can SVD can help in my case to reduce the high memory and computational time?

It depends. If your data is simply as set of a very short (2-3 words) strings and you tokenize your data by simply splitting on whitespace it won't help you at all. It cannot improve brute force approach you use and your data is already extremely sparse.

If you process your data in some context or extract more complex features (ngrams for example) it can reduce the cost buts still won't help you with overall complexity.

Community
  • 1
  • 1
zero323
  • 322,348
  • 103
  • 959
  • 935