I want to split text into sentences but keep the \n such as:
Civility vicinity graceful is it at. Improve up at to on mention perhaps raising. Way building not get formerly her peculiar.
Arrived totally in as between private. Favour of so as on pretty though elinor direct.
into sentences like:
['Civility vicinity graceful is it at.', 'Improve up at to on mention perhaps raising.', 'Way building not get formerly her peculiar.', '\n Arrived totally in as between private.', 'Favour of so as on pretty though elinor direct.']
Right now I'm using this code with re to split the sentences:
import re
alphabets= "([A-Za-z])"
prefixes = "(Mr|St|Mrs|Ms|Dr)[.]"
suffixes = "(Inc|Ltd|Jr|Sr|Co)"
starters = "(Mr|Mrs|Ms|Dr|He\s|She\s|It\s|They\s|Their\s|Our\s|We\s|But\s|However\s|That\s|This\s|Wherever)"
acronyms = "([A-Z][.][A-Z][.](?:[A-Z][.])?)"
websites = "[.](com|net|org|io|gov)"
digits = "([0-9])"
def remove_urls(text):
text = re.sub(r'http\S+', '', text)
return text
def split_into_sentences(text):
print("in")
print(text)
text = " " + text + " "
text = re.sub(prefixes,"\\1<prd>",text)
text = re.sub(websites,"<prd>\\1",text)
text = re.sub(digits + "[.]" + digits,"\\1<prd>\\2",text)
if "Ph.D" in text: text = text.replace("Ph.D.","Ph<prd>D<prd>")
text = re.sub("\s" + alphabets + "[.] "," \\1<prd> ",text)
text = re.sub(acronyms+" "+starters,"\\1<stop> \\2",text)
if "..." in text: text = text.replace("...",".<prd>")
text = re.sub(alphabets + "[.]" + alphabets + "[.]","\\1<prd>\\2<prd>",text)
text = re.sub(" "+suffixes+"[.] "+starters," \\1<stop> \\2",text)
text = re.sub(" "+suffixes+"[.]"," \\1<prd>",text)
text = re.sub(" " + alphabets + "[.]"," \\1<prd>",text)
if "”" in text: text = text.replace(".”","”.")
if "\"" in text: text = text.replace(".\"","\".")
if "!" in text: text = text.replace("!\"","\"!")
if "?" in text: text = text.replace("?\"","\"?")
text = text.replace(".",".<stop>")
text = text.replace("?","?<stop>")
text = text.replace("!","!<stop>")
text = text.replace("<prd>",".")
sentences = text.split("<stop>")
sentences = sentences[:-1]
sentences = [s.strip() for s in sentences]
print(sentences)
return sentences
However the code gets rid of the \n, which I need. I need the \n because I'm using text in moviepy, and moviepy has no built in functions to space out text with \n, so I must create my own. The only way I can do that is through having \n as a signifier in the text, but when I split my sentences it also gets rid of the \n. What should I do?