Try this using regular expressions for splitting the text, you could use negative lookbehind assertion:
import re
# This is the Lorem ipsum text modified a little bit in order to match your requirements.
# Note the following:
# 1. - "et dolore magna" --> the presence of `"`
# 2. - Sunt, in culpa etc. qui ... --> The presence if `etc.`
text = """Lorem ipsum dolor sit amet. Consectetur adipisicing elit, sed do eiusmod
tempor incididunt ut labore "et dolore magna" aliqua. Ut enim ad minim veniam. Cillum dolore
proident. Sunt, in culpa, etc. qui officia deserunt mollit anim id est laborum."""
# Here is used the negative lookbehind assertion to split the text using any point
# `.` not preceded by `etc` as separator.
sentences = re.split("(?<!etc)\.", text)
# Then all white spaces are removed to leave just the words.
sentences = [" ".join(re.findall("\w+", sentence)) for sentence in sentences]
# Finally,
print(sentences)
Of course all this is better if we have a function we can use whenever we want.
def get_sentences(text):
sentences = re.split("(?<!etc)\.", text)
return [" ".join(re.findall("\w+", sentence)) for sentence in sentences]
# Example of use.
print(get_sentences(text))
IMPORTANT
If you find another exception like etc.
, lets say, NLTK.
you can add it to the splitter pattern like this:
...
sentences = re.split("(?<!(etc|NLTK)\.", text)
...
References:
Regular Expression HOWTO
Regular expression operations