I have a very large domain name dataset. Approx size of the dataset is 1 million.
I want to find similar domains which are duplicate in dataset due to wrong spelling.
So I have been using cosine similarity for finding similar documents.
dataset = ["example.com","examplecom","googl.com","google.com"........]
tfidf_vectorizer = TfidfVectorizer(analyzer="char")
tfidf_matrix = tfidf_vectorizer.fit_transform(documents)
cs = cosine_similarity(tfidf_matrix, tfidf_matrix)
Above example is working fine for small dataset but for a large dataset, it is throwing out of memory error.
System Configuration:
1)8GB Ram
2)64 bit system and 64 bit python installed
3)i3-3210 processor
How to find cosine similarity for a large dataset?