0

I want to split text into sentences but keep the \n such as:

Civility vicinity graceful is it at. Improve up at to on mention perhaps raising. Way building not get formerly her peculiar.

Arrived totally in as between private. Favour of so as on pretty though elinor direct.

into sentences like:

['Civility vicinity graceful is it at.', 'Improve up at to on mention perhaps raising.', 'Way building not get formerly her peculiar.', '\n Arrived totally in as between private.', 'Favour of so as on pretty though elinor direct.']

Right now I'm using this code with re to split the sentences:

    import re
alphabets= "([A-Za-z])"
prefixes = "(Mr|St|Mrs|Ms|Dr)[.]"
suffixes = "(Inc|Ltd|Jr|Sr|Co)"
starters = "(Mr|Mrs|Ms|Dr|He\s|She\s|It\s|They\s|Their\s|Our\s|We\s|But\s|However\s|That\s|This\s|Wherever)"
acronyms = "([A-Z][.][A-Z][.](?:[A-Z][.])?)"
websites = "[.](com|net|org|io|gov)"
digits = "([0-9])"

def remove_urls(text):
    text = re.sub(r'http\S+', '', text)
    return text

def split_into_sentences(text):
    print("in")
    print(text)
    text = " " + text + "  "
    text = re.sub(prefixes,"\\1<prd>",text)
    text = re.sub(websites,"<prd>\\1",text)
    text = re.sub(digits + "[.]" + digits,"\\1<prd>\\2",text)
    if "Ph.D" in text: text = text.replace("Ph.D.","Ph<prd>D<prd>")
    text = re.sub("\s" + alphabets + "[.] "," \\1<prd> ",text)
    text = re.sub(acronyms+" "+starters,"\\1<stop> \\2",text)
    if "..." in text: text = text.replace("...",".<prd>")
    text = re.sub(alphabets + "[.]" + alphabets + "[.]","\\1<prd>\\2<prd>",text)
    text = re.sub(" "+suffixes+"[.] "+starters," \\1<stop> \\2",text)
    text = re.sub(" "+suffixes+"[.]"," \\1<prd>",text)
    text = re.sub(" " + alphabets + "[.]"," \\1<prd>",text)
    if "”" in text: text = text.replace(".”","”.")
    if "\"" in text: text = text.replace(".\"","\".")
    if "!" in text: text = text.replace("!\"","\"!")
    if "?" in text: text = text.replace("?\"","\"?")
    text = text.replace(".",".<stop>")
    text = text.replace("?","?<stop>")
    text = text.replace("!","!<stop>")
    text = text.replace("<prd>",".")
    sentences = text.split("<stop>")
    sentences = sentences[:-1]
    sentences = [s.strip() for s in sentences]
    print(sentences)
    return sentences

However the code gets rid of the \n, which I need. I need the \n because I'm using text in moviepy, and moviepy has no built in functions to space out text with \n, so I must create my own. The only way I can do that is through having \n as a signifier in the text, but when I split my sentences it also gets rid of the \n. What should I do?

  • can you please use code markup (not quote) for your text? It's ambiguous if you have one or two `\n`, also I don't think you need to provide all your code, stick to the minimum useful – mozway Aug 12 '22 at 04:53
  • 2
    This isn't the way to go here. Look into using a grammatical parsing package like NLTK which can correctly identify sentences. There are many edge cases which your current approach probably is not covering. – Tim Biegeleisen Aug 12 '22 at 04:57

3 Answers3

1

You can use (?<=...) to retain separator followed by what you want to remove by the split:

import re
s='Civility vicinity graceful is it at. Improve up at to on mention perhaps raising. Way building not get formerly 
her peculiar.\n\nArrived totally in as between private. Favour of so as on pretty though elinor direct.'
re.split(r'(?<=\.)[ \n]', s)

output:

['Civility vicinity graceful is it at.',
 'Improve up at to on mention perhaps raising.',
 'Way building not get formerly her peculiar.',
 '\nArrived totally in as between private.',
 'Favour of so as on pretty though elinor direct.']
mozway
  • 194,879
  • 13
  • 39
  • 75
Allan Wind
  • 23,068
  • 5
  • 28
  • 38
0

Use could use split by .

text = '''Civility vicinity graceful is it at. Improve up at to on mention 
perhaps raising. Way building not get formerly her peculiar.
Arrived totally in as between private. Favour of so as on pretty though elinor 
direct.'''

text.split('.')
>>> ['Civility vicinity graceful is it at', ' Improve up at to on mention 
perhaps raising', ' Way building not get formerly her peculiar', '\nArrived 
totally in as between private', ' Favour of so as on pretty though elinor 
direct', '']

check this Split by comma and strip whitespace in Python

Ramesh
  • 635
  • 2
  • 15
0

I have been able to reproduce your output using this:

txt = 'Civility vicinity graceful is it at. Improve up at to on mention perhaps raising. Way building not get formerly her peculiar. \nArrived totally in as between private. Favour of so as on pretty though elinor direct.'

Code:

updated_text = [a if a.endswith('.') else a+'.' for a in txt.split('. ')]

Output:

['Civility vicinity graceful is it at.', 'Improve up at to on mention perhaps raising.', 'Way building not get formerly her peculiar.', '\nArrived totally in as between private.', 'Favour of so as on pretty though elinor direct.']
saad_saeed
  • 163
  • 8