I'm not sure what is the desired output but I think you might need some paragraph segmentation before nltk.sent_tokenize
, i.e.:
>>> text = """After Du died of suffocation, her boyfriend posted a heartbreaking message online: "Losing consciousness in my arms, your breath and heartbeat became weaker and weaker. Finally they pushed you out of the cold emergency room. I failed to protect you."
...
... Li Na, 23, a migrant worker from a farming family in Jiangxi province, was looking forward to getting married in 2015."""
>>> from nltk import sent_tokenize
>>> paragraphs = text.split('\n\n')
>>> for pg in paragraphs:
... for sent in sent_tokenize(pg):
... print sent
...
After Du died of suffocation, her boyfriend posted a heartbreaking message online: "Losing consciousness in my arms, your breath and heartbeat became weaker and weaker.
Finally they pushed you out of the cold emergency room.
I failed to protect you."
Li Na, 23, a migrant worker from a farming family in Jiangxi province, was looking forward to getting married in 2015.
Possibly, you might want strings within the double quotes too, if so you could try this:
>>> import re
>>> str_in_doublequotes = r'"([^"]*)"'
>>> re.findall(str_in_doublequotes, text)
['Losing consciousness in my arms, your breath and heartbeat became weaker and weaker. Finally they pushed you out of the cold emergency room. I failed to protect you.']
Or maybe you would need this:
>>> for pg in paragraphs:
... # Collects the quotes inside the paragraph
... in_quotes = re.findall(str_in_doublequotes, pg)
... for q in in_quotes:
... # Keep track of the quotes with tabs.
... pg = pg.replace('"{}"'.format(q), '\t')
... for _pg in pg.split('\t'):
... for sent in sent_tokenize(_pg):
... print sent
... try:
... print '"{}"'.format(in_quotes.pop(0))
... except IndexError: # Nothing to pop.
... pass
...
After Du died of suffocation, her boyfriend posted a heartbreaking message online:
"Losing consciousness in my arms, your breath and heartbeat became weaker and weaker. Finally they pushed you out of the cold emergency room. I failed to protect you."
Li Na, 23, a migrant worker from a farming family in Jiangxi province, was looking forward to getting married in 2015.
When reading from file, try to use the io
package:
alvas@ubi:~$ echo -e """After Du died of suffocation, her boyfriend posted a heartbreaking message online: \"Losing consciousness in my arms, your breath and heartbeat became weaker and weaker. Finally they pushed you out of the cold emergency room. I failed to protect you.\"\n\nLi Na, 23, a migrant worker from a farming family in Jiangxi province, was looking forward to getting married in 2015.""" > in.txt
alvas@ubi:~$ cat in.txt
After Du died of suffocation, her boyfriend posted a heartbreaking message online: "Losing consciousness in my arms, your breath and heartbeat became weaker and weaker. Finally they pushed you out of the cold emergency room. I failed to protect you."
Li Na, 23, a migrant worker from a farming family in Jiangxi province, was looking forward to getting married in 2015.
alvas@ubi:~$ python
Python 2.7.6 (default, Jun 22 2015, 17:58:13)
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import io
>>> from nltk import sent_tokenize
>>> text = io.open('in.txt', 'r', encoding='utf8').read()
>>> print text
After Du died of suffocation, her boyfriend posted a heartbreaking message online: "Losing consciousness in my arms, your breath and heartbeat became weaker and weaker. Finally they pushed you out of the cold emergency room. I failed to protect you."
Li Na, 23, a migrant worker from a farming family in Jiangxi province, was looking forward to getting married in 2015.
>>> for sent in sent_tokenize(text):
... print sent
...
After Du died of suffocation, her boyfriend posted a heartbreaking message online: "Losing consciousness in my arms, your breath and heartbeat became weaker and weaker.
Finally they pushed you out of the cold emergency room.
I failed to protect you."
Li Na, 23, a migrant worker from a farming family in Jiangxi province, was looking forward to getting married in 2015.
And with the paragraph and quote extraction hacks:
>>> import io, re
>>> from nltk import sent_tokenize
>>> str_in_doublequotes = r'"([^"]*)"'
>>> paragraphs = text.split('\n\n')
>>> for pg in paragraphs:
... # Collects the quotes inside the paragraph
... in_quotes = re.findall(str_in_doublequotes, pg)
... for q in in_quotes:
... # Keep track of the quotes with tabs.
... pg = pg.replace('"{}"'.format(q), '\t')
... for _pg in pg.split('\t'):
... for sent in sent_tokenize(_pg):
... print sent
... try:
... print '"{}"'.format(in_quotes.pop(0))
... except IndexError: # Nothing to pop.
... pass
...
After Du died of suffocation, her boyfriend posted a heartbreaking message online:
"Losing consciousness in my arms, your breath and heartbeat became weaker and weaker. Finally they pushed you out of the cold emergency room. I failed to protect you."
Li Na, 23, a migrant worker from a farming family in Jiangxi province, was looking forward to getting married in 2015.
For the magic to concatenate the pre-quote sentence with the quotes (don't blink, it looks quite the same as above):
>>> import io, re
>>> from nltk import sent_tokenize
>>> str_in_doublequotes = r'"([^"]*)"'
>>> paragraphs = text.split('\n\n')
>>> for pg in paragraphs:
... # Collects the quotes inside the paragraph
... in_quotes = re.findall(str_in_doublequotes, pg)
... for q in in_quotes:
... # Keep track of the quotes with tabs.
... pg = pg.replace('"{}"'.format(q), '\t')
... for _pg in pg.split('\t'):
... for sent in sent_tokenize(_pg):
... print sent,
... try:
... print '"{}"'.format(in_quotes.pop(0))
... except IndexError: # Nothing to pop.
... pass
...
After Du died of suffocation, her boyfriend posted a heartbreaking message online: "Losing consciousness in my arms, your breath and heartbeat became weaker and weaker. Finally they pushed you out of the cold emergency room. I failed to protect you."
Li Na, 23, a migrant worker from a farming family in Jiangxi province, was looking forward to getting married in 2015.
The problem with the above code is that it is limited to sentences like:
After Du died of suffocation, her boyfriend posted a heartbreaking
message online: "Losing consciousness in my arms, your breath and
heartbeat became weaker and weaker. Finally they pushed you out of the
cold emergency room. I failed to protect you."
And cannot handle:
"Losing consciousness in my arms, your breath and heartbeat became
weaker and weaker. Finally they pushed you out of the cold emergency
room. I failed to protect you," her boyfriend posted a heartbreaking
message online after Du died of suffocation.
Just to make sure, my python/nltk versions are:
$ python -c "import nltk; print nltk.__version__"
'3.0.3'
$ python -V
Python 2.7.6
Beyond the computational aspect of the text processing, there's something subtly different about the grammar of the text in the question.
The fact that a quote is followed by a semi-colon :
is untypical of the traditional English grammar. This might have been popularized in the Chinese news because in Chinese:
啊杜窒息死亡后,男友在网上发了令人心碎的消息: "..."
In traditional English in a very prescriptive grammatical sense, it would have been:
After Du died of suffocation, her boyfriend posted a heartbreaking
message online, "..."
And a post-quotation statement would have been signalled by an ending comma instead of a fullstop, e.g.:
"...," her boyfriend posted a heartbreaking message online after Du
died of suffocation.