How can I split a string into a list by sentences, but keep the \n?

Question

I want to split text into sentences but keep the \n such as:

Civility vicinity graceful is it at. Improve up at to on mention perhaps raising. Way building not get formerly her peculiar.

Arrived totally in as between private. Favour of so as on pretty though elinor direct.

into sentences like:

['Civility vicinity graceful is it at.', 'Improve up at to on mention perhaps raising.', 'Way building not get formerly her peculiar.', '\n Arrived totally in as between private.', 'Favour of so as on pretty though elinor direct.']

Right now I'm using this code with re to split the sentences:

    import re
alphabets= "([A-Za-z])"
prefixes = "(Mr|St|Mrs|Ms|Dr)[.]"
suffixes = "(Inc|Ltd|Jr|Sr|Co)"
starters = "(Mr|Mrs|Ms|Dr|He\s|She\s|It\s|They\s|Their\s|Our\s|We\s|But\s|However\s|That\s|This\s|Wherever)"
acronyms = "([A-Z][.][A-Z][.](?:[A-Z][.])?)"
websites = "[.](com|net|org|io|gov)"
digits = "([0-9])"

def remove_urls(text):
    text = re.sub(r'http\S+', '', text)
    return text

def split_into_sentences(text):
    print("in")
    print(text)
    text = " " + text + "  "
    text = re.sub(prefixes,"\\1<prd>",text)
    text = re.sub(websites,"<prd>\\1",text)
    text = re.sub(digits + "[.]" + digits,"\\1<prd>\\2",text)
    if "Ph.D" in text: text = text.replace("Ph.D.","Ph<prd>D<prd>")
    text = re.sub("\s" + alphabets + "[.] "," \\1<prd> ",text)
    text = re.sub(acronyms+" "+starters,"\\1<stop> \\2",text)
    if "..." in text: text = text.replace("...",".<prd>")
    text = re.sub(alphabets + "[.]" + alphabets + "[.]","\\1<prd>\\2<prd>",text)
    text = re.sub(" "+suffixes+"[.] "+starters," \\1<stop> \\2",text)
    text = re.sub(" "+suffixes+"[.]"," \\1<prd>",text)
    text = re.sub(" " + alphabets + "[.]"," \\1<prd>",text)
    if "”" in text: text = text.replace(".”","”.")
    if "\"" in text: text = text.replace(".\"","\".")
    if "!" in text: text = text.replace("!\"","\"!")
    if "?" in text: text = text.replace("?\"","\"?")
    text = text.replace(".",".<stop>")
    text = text.replace("?","?<stop>")
    text = text.replace("!","!<stop>")
    text = text.replace("<prd>",".")
    sentences = text.split("<stop>")
    sentences = sentences[:-1]
    sentences = [s.strip() for s in sentences]
    print(sentences)
    return sentences

However the code gets rid of the \n, which I need. I need the \n because I'm using text in moviepy, and moviepy has no built in functions to space out text with \n, so I must create my own. The only way I can do that is through having \n as a signifier in the text, but when I split my sentences it also gets rid of the \n. What should I do?

can you please use code markup (not quote) for your text? It's ambiguous if you have one or two `\n`, also I don't think you need to provide all your code, stick to the minimum useful — mozway, Aug 12 '22 at 04:53
This isn't the way to go here. Look into using a grammatical parsing package like NLTK which can correctly identify sentences. There are many edge cases which your current approach probably is not covering. — Tim Biegeleisen, Aug 12 '22 at 04:57

score 1 · Accepted Answer · edited Aug 12 '22 at 05:03

1

You can use (?<=...) to retain separator followed by what you want to remove by the split:

import re
s='Civility vicinity graceful is it at. Improve up at to on mention perhaps raising. Way building not get formerly 
her peculiar.\n\nArrived totally in as between private. Favour of so as on pretty though elinor direct.'
re.split(r'(?<=\.)[ \n]', s)

output:

['Civility vicinity graceful is it at.',
 'Improve up at to on mention perhaps raising.',
 'Way building not get formerly her peculiar.',
 '\nArrived totally in as between private.',
 'Favour of so as on pretty though elinor direct.']

edited Aug 12 '22 at 05:03

mozway

194,879
13
39
75

answered Aug 12 '22 at 04:57

Allan Wind

23,068
5
28
38

Thank you! How could I apply this to work with ! and ? instead of just . – PyroManieAct Aug 16 '22 at 00:34
Change `\.` to `[!?.]` (you might have to backlash escape the `.` and `?`). – Allan Wind Aug 16 '22 at 13:15

Ramesh · Answer 2 · 2022-08-12T04:54:07.370

Use could use split by .

text = '''Civility vicinity graceful is it at. Improve up at to on mention 
perhaps raising. Way building not get formerly her peculiar.
Arrived totally in as between private. Favour of so as on pretty though elinor 
direct.'''

text.split('.')
>>> ['Civility vicinity graceful is it at', ' Improve up at to on mention 
perhaps raising', ' Way building not get formerly her peculiar', '\nArrived 
totally in as between private', ' Favour of so as on pretty though elinor 
direct', '']

check this Split by comma and strip whitespace in Python

score 0 · Answer 3 · answered Aug 12 '22 at 05:02

I have been able to reproduce your output using this:

txt = 'Civility vicinity graceful is it at. Improve up at to on mention perhaps raising. Way building not get formerly her peculiar. \nArrived totally in as between private. Favour of so as on pretty though elinor direct.'

Code:

updated_text = [a if a.endswith('.') else a+'.' for a in txt.split('. ')]

Output:

['Civility vicinity graceful is it at.', 'Improve up at to on mention perhaps raising.', 'Way building not get formerly her peculiar.', '\nArrived totally in as between private.', 'Favour of so as on pretty though elinor direct.']

How can I split a string into a list by sentences, but keep the \n?

3 Answers3