1

I want to create a corpus for a machine learning task. I have a small textual dataset and want to crawl similar sentences from web. I used sentence_transformers package with Bert pertained model, doc2vec and spacy similarity to measure similarity. I set the threshold to 85%, but the sentences with the similarity score higher than the threshold weren't really relevant. how can I crawl similar sentences from web in python?

Laure
  • 19
  • 3
  • Include a [minimal reproducible example](https://stackoverflow.com/help/minimal-reproducible-example) in your questions, please. Right now your question is not focused enough to be answerable. – Amitai Irron Jun 06 '20 at 12:23

1 Answers1

1

I think you should train a big model on a big corpus and then use that model to generate random sentences. The gensim library has several corpora link that you can use to find similar sentences or to train a model that generates similar sentences , here is how to do it.

DaveR
  • 1,696
  • 18
  • 24