how to crawl semantically similar sentences

Question

I want to create a corpus for a machine learning task. I have a small textual dataset and want to crawl similar sentences from web. I used sentence_transformers package with Bert pertained model, doc2vec and spacy similarity to measure similarity. I set the threshold to 85%, but the sentences with the similarity score higher than the threshold weren't really relevant. how can I crawl similar sentences from web in python?

Include a [minimal reproducible example](https://stackoverflow.com/help/minimal-reproducible-example) in your questions, please. Right now your question is not focused enough to be answerable. — Amitai Irron, Jun 06 '20 at 12:23

score 1 · Answer 1 · answered Jun 06 '20 at 15:16

I think you should train a big model on a big corpus and then use that model to generate random sentences. The gensim library has several corpora link that you can use to find similar sentences or to train a model that generates similar sentences , here is how to do it.

how to crawl semantically similar sentences

1 Answers1