Sentence tokenization for texts that contains quotes

Question

Code:

from nltk.tokenize import sent_tokenize           
pprint(sent_tokenize(unidecode(text)))

Output:

[After Du died of suffocation, her boyfriend posted a heartbreaking message online: "Losing consciousness in my arms, your breath and heartbeat became weaker and weaker.',
 'Finally they pushed you out of the cold emergency room.',
 'I failed to protect you.',
 '"Li Na, 23, a migrant worker from a farming family in Jiangxi province, was looking forward to getting married in 2015.',]

Input:

After Du died of suffocation, her boyfriend posted a heartbreaking message online: "Losing consciousness in my arms, your breath and heartbeat became weaker and weaker. Finally they pushed you out of the cold emergency room. I failed to protect you."

Li Na, 23, a migrant worker from a farming family in Jiangxi province, was looking forward to getting married in 2015.

Quotes should be included in previous sentence. Instead of " Li.

It fails at ." How to fix this?

Edit: Explaining the extraction of text.

html = open(path, "r").read()                           #reads html code
article = extractor.extract(raw_html=html)              #extracts content
text = unidecode(article.cleaned_text)                  #changes encoding

Here, article.cleaned_text is in unicode. The idea behind using this to change characters “ to ".

Solutions @alvas Incorrect Result:

['After Du died of suffocation, her boyfriend posted a heartbreaking message online: "Losing consciousness in my arms, your breath and heartbeat became weaker and weaker.',
 'Finally they pushed you out of the cold emergency room.',
 'I failed to protect you.',
 '"',
 'Li Na, 23, a migrant worker from a farming family in Jiangxi province, was looking forward to getting married in 2015.'
]

Edit2: (Updated) nltk and python version

python -c "import nltk; print nltk.__version__"
3.0.4
python -V
Python 2.7.9

Natural language parsing is hard. You might try looking at [`nltk.tokenize.punkt.PunktSentenceTokenizer.debug_decisions()`](http://www.nltk.org/api/nltk.tokenize.html#nltk.tokenize.punkt.PunktSentenceTokenizer.debug_decisions) to see why the tokenizer acted as it did. — augurar, Aug 14 '15 at 06:20
Looks like the default sentence tokenizer does not recognize quote marks as punctuation. You may be able to specify `"` as a possible sentence boundary character by creating a [`nltk.tokenize.punkt.PunktLanguageVars`](http://www.nltk.org/api/nltk.tokenize.html#nltk.tokenize.punkt.PunktLanguageVars) object. — augurar, Aug 14 '15 at 06:25
@Raniz Quotes should be included in previous sentence. Instead of `" Li.` — Abhishek Bhatia, Aug 14 '15 at 06:30
@augurar Not sure if that is good solution. Since I am doing this on many docs. Expect something like `"It is .......", said the company.` — Abhishek Bhatia, Aug 14 '15 at 06:32
@augurar Any suggestions would be greatly helpful. I am kind of stuck here. — Abhishek Bhatia, Aug 14 '15 at 11:05
I'm getting the correct output. Can you show us how you read the text into the "text" variable? — yvespeirsman, Aug 14 '15 at 12:30

score 6 · Answer 1 · edited May 23 '17 at 10:33

I'm not sure what is the desired output but I think you might need some paragraph segmentation before nltk.sent_tokenize, i.e.:

>>> text = """After Du died of suffocation, her boyfriend posted a heartbreaking message online: "Losing consciousness in my arms, your breath and heartbeat became weaker and weaker. Finally they pushed you out of the cold emergency room. I failed to protect you."
... 
... Li Na, 23, a migrant worker from a farming family in Jiangxi province, was looking forward to getting married in 2015."""
>>> from nltk import sent_tokenize
>>> paragraphs = text.split('\n\n')
>>> for pg in paragraphs:
...     for sent in sent_tokenize(pg):
...             print sent
... 
After Du died of suffocation, her boyfriend posted a heartbreaking message online: "Losing consciousness in my arms, your breath and heartbeat became weaker and weaker.
Finally they pushed you out of the cold emergency room.
I failed to protect you."
Li Na, 23, a migrant worker from a farming family in Jiangxi province, was looking forward to getting married in 2015.

Possibly, you might want strings within the double quotes too, if so you could try this:

>>> import re
>>> str_in_doublequotes = r'"([^"]*)"'
>>> re.findall(str_in_doublequotes, text)
['Losing consciousness in my arms, your breath and heartbeat became weaker and weaker. Finally they pushed you out of the cold emergency room. I failed to protect you.']

Or maybe you would need this:

>>> for pg in paragraphs:
...     # Collects the quotes inside the paragraph 
...     in_quotes = re.findall(str_in_doublequotes, pg)
...     for q in in_quotes:
...             # Keep track of the quotes with tabs.
...             pg = pg.replace('"{}"'.format(q), '\t')
...     for _pg in pg.split('\t'):
...             for sent in sent_tokenize(_pg):
...                     print sent
...             try:
...                     print '"{}"'.format(in_quotes.pop(0))
...             except IndexError: # Nothing to pop.
...                     pass
... 
After Du died of suffocation, her boyfriend posted a heartbreaking message online: 
"Losing consciousness in my arms, your breath and heartbeat became weaker and weaker. Finally they pushed you out of the cold emergency room. I failed to protect you."
Li Na, 23, a migrant worker from a farming family in Jiangxi province, was looking forward to getting married in 2015.

When reading from file, try to use the io package:

alvas@ubi:~$ echo -e """After Du died of suffocation, her boyfriend posted a heartbreaking message online: \"Losing consciousness in my arms, your breath and heartbeat became weaker and weaker. Finally they pushed you out of the cold emergency room. I failed to protect you.\"\n\nLi Na, 23, a migrant worker from a farming family in Jiangxi province, was looking forward to getting married in 2015.""" > in.txt
alvas@ubi:~$ cat in.txt 
After Du died of suffocation, her boyfriend posted a heartbreaking message online: "Losing consciousness in my arms, your breath and heartbeat became weaker and weaker. Finally they pushed you out of the cold emergency room. I failed to protect you."

Li Na, 23, a migrant worker from a farming family in Jiangxi province, was looking forward to getting married in 2015.
alvas@ubi:~$ python
Python 2.7.6 (default, Jun 22 2015, 17:58:13) 
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import io
>>> from nltk import sent_tokenize
>>> text = io.open('in.txt', 'r', encoding='utf8').read()
>>> print text
After Du died of suffocation, her boyfriend posted a heartbreaking message online: "Losing consciousness in my arms, your breath and heartbeat became weaker and weaker. Finally they pushed you out of the cold emergency room. I failed to protect you."

Li Na, 23, a migrant worker from a farming family in Jiangxi province, was looking forward to getting married in 2015.

>>> for sent in sent_tokenize(text):
...     print sent
... 
After Du died of suffocation, her boyfriend posted a heartbreaking message online: "Losing consciousness in my arms, your breath and heartbeat became weaker and weaker.
Finally they pushed you out of the cold emergency room.
I failed to protect you."
Li Na, 23, a migrant worker from a farming family in Jiangxi province, was looking forward to getting married in 2015.

And with the paragraph and quote extraction hacks:

>>> import io, re
>>> from nltk import sent_tokenize
>>> str_in_doublequotes = r'"([^"]*)"'
>>> paragraphs = text.split('\n\n')
>>> for pg in paragraphs:
...     # Collects the quotes inside the paragraph 
...     in_quotes = re.findall(str_in_doublequotes, pg)
...     for q in in_quotes:
...             # Keep track of the quotes with tabs.
...             pg = pg.replace('"{}"'.format(q), '\t')
...     for _pg in pg.split('\t'):
...             for sent in sent_tokenize(_pg):
...                     print sent
...             try:
...                     print '"{}"'.format(in_quotes.pop(0))
...             except IndexError: # Nothing to pop.
...                     pass
... 
After Du died of suffocation, her boyfriend posted a heartbreaking message online: 
"Losing consciousness in my arms, your breath and heartbeat became weaker and weaker. Finally they pushed you out of the cold emergency room. I failed to protect you."
Li Na, 23, a migrant worker from a farming family in Jiangxi province, was looking forward to getting married in 2015.

For the magic to concatenate the pre-quote sentence with the quotes (don't blink, it looks quite the same as above):

>>> import io, re
>>> from nltk import sent_tokenize
>>> str_in_doublequotes = r'"([^"]*)"'
>>> paragraphs = text.split('\n\n')
>>> for pg in paragraphs:
...     # Collects the quotes inside the paragraph 
...     in_quotes = re.findall(str_in_doublequotes, pg)
...     for q in in_quotes:
...             # Keep track of the quotes with tabs.
...             pg = pg.replace('"{}"'.format(q), '\t')
...     for _pg in pg.split('\t'):
...             for sent in sent_tokenize(_pg):
...                     print sent,
...             try:
...                     print '"{}"'.format(in_quotes.pop(0))
...             except IndexError: # Nothing to pop.
...                     pass
... 
After Du died of suffocation, her boyfriend posted a heartbreaking message online:  "Losing consciousness in my arms, your breath and heartbeat became weaker and weaker. Finally they pushed you out of the cold emergency room. I failed to protect you."
Li Na, 23, a migrant worker from a farming family in Jiangxi province, was looking forward to getting married in 2015.

The problem with the above code is that it is limited to sentences like:

After Du died of suffocation, her boyfriend posted a heartbreaking message online: "Losing consciousness in my arms, your breath and heartbeat became weaker and weaker. Finally they pushed you out of the cold emergency room. I failed to protect you."

And cannot handle:

"Losing consciousness in my arms, your breath and heartbeat became weaker and weaker. Finally they pushed you out of the cold emergency room. I failed to protect you," her boyfriend posted a heartbreaking message online after Du died of suffocation.

Just to make sure, my python/nltk versions are:

$ python -c "import nltk; print nltk.__version__"
'3.0.3'
$ python -V
Python 2.7.6

Beyond the computational aspect of the text processing, there's something subtly different about the grammar of the text in the question.

The fact that a quote is followed by a semi-colon : is untypical of the traditional English grammar. This might have been popularized in the Chinese news because in Chinese:

啊杜窒息死亡后，男友在网上发了令人心碎的消息: "..."

In traditional English in a very prescriptive grammatical sense, it would have been:

After Du died of suffocation, her boyfriend posted a heartbreaking message online, "..."

And a post-quotation statement would have been signalled by an ending comma instead of a fullstop, e.g.:

"...," her boyfriend posted a heartbreaking message online after Du died of suffocation.

Thanks for the amazing answer! Please check my edit. There still seems to be a problem. — Abhishek Bhatia, Aug 14 '15 at 13:24
Thanks for prompt reply! But I am looking to conserve the quotes and use a regex later exactly as you have mentioned! — Abhishek Bhatia, Aug 14 '15 at 13:29
@AbhishekBhatia, is the final desired output as shown in the current answer? — alvas, Aug 14 '15 at 13:32
Amazing! But there is still one problem, please check edit2 above. — Abhishek Bhatia, Aug 14 '15 at 13:40
One other point I forgot: This is one sentence: `After Du died of suffocation, her boyfriend posted a heartbreaking message online: "Losing consciousness in my arms, your breath and heartbeat became weaker and weaker. Finally they pushed you out of the cold emergency room. I failed to protect you."` — Abhishek Bhatia, Aug 14 '15 at 13:47
It really depends on the task I would keep that as two sentences but give me a min, it's possible to concat them. — alvas, Aug 14 '15 at 13:49
Great observation! I updated the ntlk and blank line went away. — Abhishek Bhatia, Aug 14 '15 at 14:00
Amazing insight in regards with the grammatical context. The problem I have at hand is I do this for numerous news articles ~1000. This is one of them. I am looking for generalized solution, but I don't want to miss on these too. — Abhishek Bhatia, Aug 14 '15 at 14:04
I have written something similar but I was to extract quotations from Harry Potter and Sherlock Holmes so the patterns were standardized. But i remembered that I had to first grab all quotations and then find the sentence before and after to see the patterns. Then use some regex to clean it up to the output I require. Depending on the text, it might be a 30mins to 1 hour work but it's worth it since it doesn't perpetuate errors to the tasks in the ending pipeline (e.g. tagging, classification). — alvas, Aug 14 '15 at 14:07
My suggestion is first, sort out quotes with pre/post-statements and then separate them. Then sort out quotes with pre and post statements. Then use 3-4 different patterns for the different type of quotes. And that should clean-up and extract the sentences perfectly. — alvas, Aug 14 '15 at 14:09

Sentence tokenization for texts that contains quotes

1 Answers1

Linked