Suppose I have a docx file like this :
When I was a young boy my father took me into the city to see a marching band. He said, "Son when you grow up would you be the savior of the broken?". My father sat beside me, hugging my shoulders with both of his arms. I said "I Would.". My father replied "That is my boy!"
And I want to segmentation the docx base on direct sentence. Like this :
sent1 : He said, "Son when you grow up would you be the savior of the broken?"
sent2 : I said "I Would."
sent3 : My father replied "That is my boy!"
I tried using regex. the result is this
When I was a young boy my father took me into the city to see a marching band.
He said, "Son when you grow up would you be the savior of the broken?
".
My father sat beside me, hugging my shoulders with both of his arms.
I said "I Would.
".
My father replied "That is my boy!
regex code :
import re
SENTENCE_REGEX = re.compile('[^!?\.]+[!?\.]')
text = open ('text.docx','r')
def parse_sentences(text):
return [x.lstrip() for x in SENTENCE_REGEX.findall(text)]
def print_sentences(sentences):
print ("\n\n".join(sentences))
if __name__ == "__main__":
print_sentences(parse_sentences(text))