3

As a programming noob, I am trying to find similar sentences in several hundreds of newspaper articles. I have tried my code with a smaller text sample which has worked brilliantly. Now, with a larger text file (using the same code), I get the error code "[E1002] Span index out of range.".

This is my code so far:

!pip install spacy
import spacy
nlp = spacy.load('en_core_web_sm')
nlp.max_length = 2000000
with open('/content/BSE.txt', 'r', encoding="utf-8", errors="ignore") as f:
    sentences_articles = f.read()
about_doc = nlp(sentences_articles)
sentences = list(about_doc.sents)

len(sentences)

sentences[:10]

!pip install -U sentence-transformers
from sentence_transformers import SentenceTransformer, util
import torch

embedder = SentenceTransformer('all-mpnet-base-v2')

corpus = sentences
corpus_embeddings = embedder.encode(corpus, show_progress_bar=True, batch_size = 128)

The progress bar stops at 94%, with error "[E1002] Span index out of range". I have used the .readlines() function, which worked, yet because of my text data's nature has produced unusable results (but no error!). I limited the number of words in each sentence, but that didn't help either. I tried several text data (different length, different content), but without success.

Any suggestions on how to fix this?

Mathias
  • 51
  • 3

1 Answers1

1

I had a similar problem with the same mistake, and for me it was solved after changing sentences from a list[Span] to list[str] as this is what .encode() requires. Instead of sentences = list(about_doc.sents), write sentences = list(sent.text for sent in about_doc.sents)

tnitn
  • 11
  • 1