1

I'm trying to split this document into paragraphs. Specifically, I would like to split the text whenever there is a line break (<br>)

This is the code I'm using but is not producing the results I hoped

nlp = spacy.load("en_core_web_lg")

def set_custom_boundaries(doc):
    for token in doc[:-1]:
        if token.text == "<br>":
            doc[token.i+1].is_sent_start = True
    return doc

nlp.add_pipe(set_custom_boundaries, before="parser")
doc = nlp(text)
print([sent.text for sent in doc.sents])

A similar solution could be achieved by using NLTK's TextTilingTokenizer but wanted to check whether there is anything similar within Spacy

Dan
  • 431
  • 6
  • 20

1 Answers1

1

You're almost there, but the problem is that the default Tokenizer splits on '<' and '>', hence the condition token.text == "<br>" is never true. I'd add space before and after <br>. E.g.

import spacy
from spacy.symbols import ORTH


def set_custom_boundaries(doc):
    for token in doc[:-1]:
        if token.text == "<br>":
            doc[token.i+1].is_sent_start = True
    return doc

nlp = spacy.load("en_core_web_sm")
text = "the quick brown fox<br>jumps over the lazy dog"
text = text.replace('<br>', ' <br> ')
special_case = [{ORTH: "<br>"}]
nlp.tokenizer.add_special_case("<br>", special_case)

nlp.add_pipe(set_custom_boundaries, first=True)
doc = nlp(text)
print([sent.text for sent in doc.sents])

Also take a look at this PR, after it's merged to master, it'll no longer be necessary to wrap in spaces.

https://github.com/explosion/spaCy/pull/4259

dimid
  • 7,285
  • 1
  • 46
  • 85