How to find cosine similarity for of a very Large Array

Question

I have a very large domain name dataset. Approx size of the dataset is 1 million.

I want to find similar domains which are duplicate in dataset due to wrong spelling.

So I have been using cosine similarity for finding similar documents.

dataset = ["example.com","examplecom","googl.com","google.com"........]
tfidf_vectorizer = TfidfVectorizer(analyzer="char")
tfidf_matrix = tfidf_vectorizer.fit_transform(documents)
cs = cosine_similarity(tfidf_matrix, tfidf_matrix)

Above example is working fine for small dataset but for a large dataset, it is throwing out of memory error.

System Configuration:

1)8GB Ram

2)64 bit system and 64 bit python installed

3)i3-3210 processor

How to find cosine similarity for a large dataset?

Are you using the `sklearn.metrics.pairwise.cosine_similarity` function? Because that returns a matrix of shape=(n_samples, n_samples), i.e. if your dataset has 1 million samples, it tries to return a matrix of 1e^12 samples, which is too large. You would need to reduce the size of your input or find some way to divide your problem into smaller subproblems — Thijs van Ede, Oct 16 '18 at 09:50
@ThijsvanEde, Yes I am using sklearn.metrics.pairwise.cosine_similarity function — Rakesh Chaudhari, Oct 16 '18 at 09:56
What is your plan afterwards with the similarities? As @ThijsvanEde notes, you'd have an array of a literally trillion elements. How would you use it? — Amadan, Oct 16 '18 at 10:57

Daniel F · Answer 1 · 2018-10-16T11:05:06.820

1

You can use a KDTree based on normalized inputs to generate cosine distance, as per the answer here. Then it's just a case of setting a minimum distance you want to return (so you don't keep all the larger distances, which is most of the memory you are using) and returning a sparse distance matrix using, for example, a coo_matrix from scipy.spatial.cKDTree.sparse_distance_matrix.

Unfortunately I don't have my interpreter handy to code up a full answer right now, but that's the jist of it.

Make sure whatever model you're fitting from that distance matrix can accept sparse inputs, though.

edited Oct 16 '18 at 11:05

answered Oct 16 '18 at 10:59

Daniel F

13,620
2
29
55

1

Could please give more detail on? How to use KDTree? – Rakesh Chaudhari Oct 17 '18 at 04:57

How to find cosine similarity for of a very Large Array

1 Answers1