How can I break a document (e.g., paragraph, book, etc) into sentences.
For example, "The dog ran. The cat jumped"
into ["The dog ran", "The cat jumped"]
with spacy?
How can I break a document (e.g., paragraph, book, etc) into sentences.
For example, "The dog ran. The cat jumped"
into ["The dog ran", "The cat jumped"]
with spacy?
The up-to-date answer is this:
from __future__ import unicode_literals, print_function
from spacy.lang.en import English # updated
raw_text = 'Hello, world. Here are two sentences.'
nlp = English()
nlp.add_pipe(nlp.create_pipe('sentencizer')) # updated
doc = nlp(raw_text)
sentences = [sent.string.strip() for sent in doc.sents]
Answer
import spacy
nlp = spacy.load('en_core_web_sm')
text = 'My first birthday was great. My 2. was even better.'
sentences = [i for i in nlp(text).sents]
Additional info
This assumes that you have already installed the model "en_core_web_sm" on your system. If not, you can easily install it by running the following command in your terminal:
$ python -m spacy download en_core_web_sm
(See here for an overview of all available models.)
Depending on your data this can lead to better results than just using spacy.lang.en.English
. One (very simple) comparison example:
import spacy
from spacy.lang.en import English
nlp_simple = English()
nlp_simple.add_pipe(nlp_simple.create_pipe('sentencizer'))
nlp_better = spacy.load('en_core_web_sm')
text = 'My first birthday was great. My 2. was even better.'
for nlp in [nlp_simple, nlp_better]:
for i in nlp(text).sents:
print(i)
print('-' * 20)
Outputs:
>>> My first birthday was great.
>>> My 2.
>>> was even better.
>>> --------------------
>>> My first birthday was great.
>>> My 2. was even better.
>>> --------------------
With spacy 3.0.1 they changed the pipline.
from spacy.lang.en import English
nlp = English()
nlp.add_pipe('sentencizer')
def split_in_sentences(text):
doc = nlp(text)
return [str(sent).strip() for sent in doc.sents]
From spacy's github support page
from __future__ import unicode_literals, print_function
from spacy.en import English
raw_text = 'Hello, world. Here are two sentences.'
nlp = English()
doc = nlp(raw_text)
sentences = [sent.string.strip() for sent in doc.sents]
For current versions (e.g. 3.x and above) use the code below for optimal results with the statistical model rather than the rule based sentencizer
component.
Also note that you can speed up processing and reduce the memory footprint if you include only the pipeline components that are needed for sentence separation.
import spacy
# instantiate pipeline with any model of your choosing
nlp = spacy.load("en_core_web_sm")
text = "The dog ran. The cat jumped. The 2. fox hides behind the house."
# only select necessary pipeline components to speed up processing
with nlp.select_pipes(enable=['tok2vec', "parser", "senter"]):
doc = nlp(text)
for sentence in doc.sents:
print(sentence)
Updated to reflect the comments in the first answer
from spacy.lang.en import English
raw_text = 'Hello, world. Here are two sentences.'
nlp = English()
nlp.add_pipe('sentencizer')
doc = nlp(raw_text)
sentences = [sent.text.strip() for sent in doc.sents]