I am having a huge list of names-surnames and I am trying to merge them. For example 'Michael Jordan'
with Jordan Michael
.
I am doing the following procedure using pyspark
:
- Calculate tfidf -> compute cos similarity -> convert to sparse matrix
- calculate string distance matrix -> convert to dense matrix
- element-wise multiplication between tfidf sparse matrix and string distance dense matrix to calculate the 'final similarity'
This works ok for 10000 names but I doubt about how long it will take to calculate a million names similarity as each matrix is 1000000x1000000 (As the matrices are symmetric I am taking only the upper triangle matrix but that doesn't change so much the high complexity time that is needed).
I have read that after computing the tfidf it is really useful to compute the SVD of the output matrices to reduce the dimensions. From the documentation I couldn't find an example of computeSVD
for pyspark. It doesn't exist?
And how can SVD can help in my case to reduce the high memory and computational time?
Any feedback and ideas are welcome.