Spacy custom sentence segmentation on line break

Question

I'm trying to split this document into paragraphs. Specifically, I would like to split the text whenever there is a line break (<br>)

This is the code I'm using but is not producing the results I hoped

nlp = spacy.load("en_core_web_lg")

def set_custom_boundaries(doc):
    for token in doc[:-1]:
        if token.text == "<br>":
            doc[token.i+1].is_sent_start = True
    return doc

nlp.add_pipe(set_custom_boundaries, before="parser")
doc = nlp(text)
print([sent.text for sent in doc.sents])

A similar solution could be achieved by using NLTK's TextTilingTokenizer but wanted to check whether there is anything similar within Spacy

https://spacy.io/usage/rule-based-matching I remember seeing something like [{"ORTH": "<"}, {"LOWER": "br"}, {"ORTH": ">"}], — Programmer_nltk, May 14 '20 at 00:46
How about simply replacing `
` with a dot before calling the spacy pipeline? — dimid, May 14 '20 at 11:08
See also https://stackoverflow.com/questions/52205475/sentence-segmentation-using-spacy — alelom, Jun 21 '21 at 12:59

score 1 · Answer 1 · answered May 14 '20 at 11:40

You're almost there, but the problem is that the default Tokenizer splits on '<' and '>', hence the condition token.text == "<br>" is never true. I'd add space before and after <br>. E.g.

import spacy
from spacy.symbols import ORTH


def set_custom_boundaries(doc):
    for token in doc[:-1]:
        if token.text == "<br>":
            doc[token.i+1].is_sent_start = True
    return doc

nlp = spacy.load("en_core_web_sm")
text = "the quick brown fox<br>jumps over the lazy dog"
text = text.replace('<br>', ' <br> ')
special_case = [{ORTH: "<br>"}]
nlp.tokenizer.add_special_case("<br>", special_case)

nlp.add_pipe(set_custom_boundaries, first=True)
doc = nlp(text)
print([sent.text for sent in doc.sents])

Also take a look at this PR, after it's merged to master, it'll no longer be necessary to wrap in spaces.

https://github.com/explosion/spaCy/pull/4259

Spacy custom sentence segmentation on line break

1 Answers1

Linked