1

I am using spaCy to do sentence segmentation on texts that may start with

text1 = "1. Dies ist ein Text"
text2 = "A. Dies ist ein Text"
text3 = "1.) Dies ist ein Text"
text4 = "B.) Dies ist ein Text"

For all these texts it could be that the paragraph numbers are followed by \r, \n or \t.

Using spaCy sentence segmentation this yields in the following results for the first sentence in each text:

**** 1.
**** A.
**** 1.)
**** B.)

Therefore, I am attempting to add a rule how sentence should be split by

  1. writing my function including such rule and
  2. passing this function to the nlp.pipeline

Unfortunately, I am having trouble defining this rule properly.

I have done the following:

def custom_sentensizer(doc):

    boundary1 = re.compile(r'^[a-zA-Z0-9][\.]?$')
    boundary2 = re.compile(r'\)')
    prev = doc[0].text
    length = len(doc)
    for i, token in enumerate(doc):
        if (boundary1.match(prev) and i != (length -1)) or (boundary2.match(token.text) and prev == "." and i != (length -1)):
            doc[i+1].sent_start = False
        prev = token.text
    return doc

and passed this function to nlp

nlp = spacy.load('de_core_news_sm')
nlp.add_pipe(custom_sentensizer, before='parser')

all_sentences = []

for text in texts: # texts is list of list with each list including one text
    doc = nlp(text)
    sentences = [sent for sent in doc.sents]
    all_sentences.append(sentences)

For the above text it seems to work, but only where there are no \r, \n and \t.

Therefore, my two questions:

  1. How do I deal with \r, \n and \t as they are sometimes valid boundaries for sentence splitting, i.e. I don't want to define a rule to exclude these.

  2. My own function seems very complicated. Is there an easier way to do this?

Thanks for your help!

FredMaster
  • 1,211
  • 1
  • 15
  • 35
  • This answer may help you: https://stackoverflow.com/questions/52205475/sentence-segmentation-using-spacy – ETL Nov 14 '19 at 19:44
  • Possible duplicate of [Sentence Segmentation using Spacy](https://stackoverflow.com/questions/52205475/sentence-segmentation-using-spacy) – ETL Nov 14 '19 at 19:44

0 Answers0