0

I have a very large domain name dataset. Approx size of the dataset is 1 million.

I want to find similar domains which are duplicate in dataset due to wrong spelling.

So I have been using cosine similarity for finding similar documents.

dataset = ["example.com","examplecom","googl.com","google.com"........]
tfidf_vectorizer = TfidfVectorizer(analyzer="char")
tfidf_matrix = tfidf_vectorizer.fit_transform(documents)
cs = cosine_similarity(tfidf_matrix, tfidf_matrix)

Above example is working fine for small dataset but for a large dataset, it is throwing out of memory error.

System Configuration:

1)8GB Ram

2)64 bit system and 64 bit python installed

3)i3-3210 processor

How to find cosine similarity for a large dataset?

Rakesh Chaudhari
  • 3,310
  • 1
  • 27
  • 25
  • 1
    Are you using the `sklearn.metrics.pairwise.cosine_similarity` function? Because that returns a matrix of shape=(n_samples, n_samples), i.e. if your dataset has 1 million samples, it tries to return a matrix of 1e^12 samples, which is too large. You would need to reduce the size of your input or find some way to divide your problem into smaller subproblems – Thijs van Ede Oct 16 '18 at 09:50
  • @ThijsvanEde, Yes I am using sklearn.metrics.pairwise.cosine_similarity function – Rakesh Chaudhari Oct 16 '18 at 09:56
  • What is your plan afterwards with the similarities? As @ThijsvanEde notes, you'd have an array of a literally trillion elements. How would you use it? – Amadan Oct 16 '18 at 10:57

1 Answers1

1

You can use a KDTree based on normalized inputs to generate cosine distance, as per the answer here. Then it's just a case of setting a minimum distance you want to return (so you don't keep all the larger distances, which is most of the memory you are using) and returning a sparse distance matrix using, for example, a coo_matrix from scipy.spatial.cKDTree.sparse_distance_matrix.

Unfortunately I don't have my interpreter handy to code up a full answer right now, but that's the jist of it.

Make sure whatever model you're fitting from that distance matrix can accept sparse inputs, though.

Daniel F
  • 13,620
  • 2
  • 29
  • 55