How to break up document by sentences with Spacy

Question

How can I break a document (e.g., paragraph, book, etc) into sentences.

For example, "The dog ran. The cat jumped" into ["The dog ran", "The cat jumped"] with spacy?

@Julien see the updated question. I did not mean literally "The dog ran. The cat jumped". Consider "Mr. Baxter ate a pickle." — Ulad Kasach, Sep 19 '17 at 01:17

score 34 · Accepted Answer · answered Apr 23 '19 at 21:39

34

The up-to-date answer is this:

from __future__ import unicode_literals, print_function
from spacy.lang.en import English # updated

raw_text = 'Hello, world. Here are two sentences.'
nlp = English()
nlp.add_pipe(nlp.create_pipe('sentencizer')) # updated
doc = nlp(raw_text)
sentences = [sent.string.strip() for sent in doc.sents]

answered Apr 23 '19 at 21:39

npit

2,219
2
19
25

1

If there is an etc. in a sentence it fails. – Aman Dalmia Apr 14 '20 at 15:08
2

This uses the rule-based method, rather than the statistical model to split sentences. For my use case, using `en_core_web_sm` worked better, and `en_core_web_lg` better yet, and fast enough for my needs. See [KB_'s answer](https://stackoverflow.com/a/61254146/1343535). – dbenton May 24 '20 at 04:21
8

Actually in spacy 3.0 the syntax is now `nlp.add_pipe('sentencizer')` as @user8189050 notes. – smci Mar 13 '21 at 04:43
5

spacy 3.0.6: Change `sent.string.strip()` to `sent.text.strip()` – Thang Pham Jun 29 '21 at 15:48

score 27 · Answer 2 · answered Apr 16 '20 at 15:32

Answer

import spacy
nlp = spacy.load('en_core_web_sm')

text = 'My first birthday was great. My 2. was even better.'
sentences = [i for i in nlp(text).sents]

Additional info
This assumes that you have already installed the model "en_core_web_sm" on your system. If not, you can easily install it by running the following command in your terminal:

$ python -m spacy download en_core_web_sm

(See here for an overview of all available models.)

Depending on your data this can lead to better results than just using spacy.lang.en.English. One (very simple) comparison example:

import spacy
from spacy.lang.en import English

nlp_simple = English()
nlp_simple.add_pipe(nlp_simple.create_pipe('sentencizer'))

nlp_better = spacy.load('en_core_web_sm')


text = 'My first birthday was great. My 2. was even better.'

for nlp in [nlp_simple, nlp_better]:
    for i in nlp(text).sents:
        print(i)
    print('-' * 20)

Outputs:

>>> My first birthday was great.
>>> My 2.
>>> was even better.
>>> --------------------
>>> My first birthday was great.
>>> My 2. was even better.
>>> --------------------

score 16 · Answer 3 · answered Feb 02 '21 at 11:56

16

With spacy 3.0.1 they changed the pipline.

from spacy.lang.en import English 

nlp = English()
nlp.add_pipe('sentencizer')


def split_in_sentences(text):
    doc = nlp(text)
    return [str(sent).strip() for sent in doc.sents]

answered Feb 02 '21 at 11:56

user8189050

171
1
4

4

This should be the accepted answer, as of spacy 3.0 – smci Mar 13 '21 at 04:43
What if the language is not English? – Espoir Murhabazi Jun 20 '22 at 23:40

score 13 · Answer 4 · answered Sep 19 '17 at 01:15

13

From spacy's github support page

from __future__ import unicode_literals, print_function
from spacy.en import English

raw_text = 'Hello, world. Here are two sentences.'
nlp = English()
doc = nlp(raw_text)
sentences = [sent.string.strip() for sent in doc.sents]

answered Sep 19 '17 at 01:15

Ulad Kasach

11,558
11
61
87

5

This answer is outdated as of 2019 (SpaCy 2.1.8). npit@ answer is better. – ywat Nov 01 '19 at 20:44

score 3 · Answer 5 · answered Dec 24 '22 at 12:01

For current versions (e.g. 3.x and above) use the code below for optimal results with the statistical model rather than the rule based sentencizer component.

Also note that you can speed up processing and reduce the memory footprint if you include only the pipeline components that are needed for sentence separation.

import spacy

# instantiate pipeline with any model of your choosing
nlp = spacy.load("en_core_web_sm")

text = "The dog ran. The cat jumped. The 2. fox hides behind the house."

# only select necessary pipeline components to speed up processing
with nlp.select_pipes(enable=['tok2vec', "parser", "senter"]):
    doc = nlp(text)
    
for sentence in doc.sents:
    print(sentence)

score 1 · Answer 6 · answered Oct 03 '21 at 06:25

Updated to reflect the comments in the first answer

from spacy.lang.en import English

raw_text = 'Hello, world. Here are two sentences.'
nlp = English()
nlp.add_pipe('sentencizer')
doc = nlp(raw_text)
sentences = [sent.text.strip() for sent in doc.sents]

How to break up document by sentences with Spacy

6 Answers6

Linked