Text Segmentation Based On Direct Sentence

Question

Suppose I have a docx file like this :

When I was a young boy my father took me into the city to see a marching band. He said, "Son when you grow up would you be the savior of the broken?". My father sat beside me, hugging my shoulders with both of his arms. I said "I Would.". My father replied "That is my boy!"

And I want to segmentation the docx base on direct sentence. Like this :

sent1 : He said, "Son when you grow up would you be the savior of the broken?"

sent2 : I said "I Would."

sent3 : My father replied "That is my boy!"

I tried using regex. the result is this

When I was a young boy my father took me into the city to see a marching band.

He said, "Son when you grow up would you be the savior of the broken?

".

My father sat beside me, hugging my shoulders with both of his arms.

I said "I Would.

".

My father replied "That is my boy!

regex code :

import re
SENTENCE_REGEX = re.compile('[^!?\.]+[!?\.]')
text = open ('text.docx','r')

def parse_sentences(text):
   return [x.lstrip() for x in SENTENCE_REGEX.findall(text)]

def print_sentences(sentences):
    print ("\n\n".join(sentences))

if __name__ == "__main__":
    print_sentences(parse_sentences(text))

`I tried using regex.` With what code? – CertainPerformance Sep 02 '18 at 06:19 — CertainPerformance, Sep 02 '18 at 06:19

score 0 · Answer 1 · answered Sep 02 '18 at 07:48

0

import re

txt = '''When I was a young boy my father took me into the city to see a marching band. He said, "Son when you grow up would you be the savior of the broken?" My father sat beside me, hugging my shoulders with both of his arms. I said "I Would." My father replied "That is my boy!"'''

pttrn = re.compile(r'(\.|\?|\!)(\'|\")?\s')

new = re.sub(pttrn, r'\1\2\n\n', txt)

print(new)

Output:

When I was a young boy my father took me into the city to see a marching band.

He said, "Son when you grow up would you be the savior of the broken?".

My father sat beside me, hugging my shoulders with both of his arms.


I said "I Would."

My father replied "That is my boy!"

PS: As far as I know, endings like ?"., .". or !". are not allowed in English.

answered Sep 02 '18 at 07:48

Zilong Li

889
10
23

and how do I group it based on direct sentence? – Syafiqur Rahman Sep 02 '18 at 08:04
@SyafiqurRahman lst = output.slip("\n\n") – Zilong Li Sep 02 '18 at 08:30
Slip? Did you mean split? 'cause when I tried, no attribute Slip. – Syafiqur Rahman Sep 02 '18 at 08:40
@SyafiqurRahman Yes I do mean split. Sorry for the typo. You can split a string into a list of strings based on any delimiter. – Zilong Li Sep 02 '18 at 23:54

Text Segmentation Based On Direct Sentence

1 Answers1