I am using spaCy to do sentence segmentation on texts that may start with
text1 = "1. Dies ist ein Text"
text2 = "A. Dies ist ein Text"
text3 = "1.) Dies ist ein Text"
text4 = "B.) Dies ist ein Text"
For all these texts it could be that the paragraph numbers are followed by \r, \n or \t.
Using spaCy sentence segmentation this yields in the following results for the first sentence in each text:
**** 1.
**** A.
**** 1.)
**** B.)
Therefore, I am attempting to add a rule how sentence should be split by
- writing my function including such rule and
- passing this function to the nlp.pipeline
Unfortunately, I am having trouble defining this rule properly.
I have done the following:
def custom_sentensizer(doc):
boundary1 = re.compile(r'^[a-zA-Z0-9][\.]?$')
boundary2 = re.compile(r'\)')
prev = doc[0].text
length = len(doc)
for i, token in enumerate(doc):
if (boundary1.match(prev) and i != (length -1)) or (boundary2.match(token.text) and prev == "." and i != (length -1)):
doc[i+1].sent_start = False
prev = token.text
return doc
and passed this function to nlp
nlp = spacy.load('de_core_news_sm')
nlp.add_pipe(custom_sentensizer, before='parser')
all_sentences = []
for text in texts: # texts is list of list with each list including one text
doc = nlp(text)
sentences = [sent for sent in doc.sents]
all_sentences.append(sentences)
For the above text it seems to work, but only where there are no \r
, \n
and \t
.
Therefore, my two questions:
How do I deal with
\r
,\n
and\t
as they are sometimes valid boundaries for sentence splitting, i.e. I don't want to define a rule to exclude these.My own function seems very complicated. Is there an easier way to do this?
Thanks for your help!