Extracting quotations/citations with nltk (not regex)

Question

The input list of sentences:

sentences = [
    """Well, I've tried to say "How Doth the Little Busy Bee," but it all came different!""",
    """Alice replied in a very melancholy voice. She continued, 'I'll try again.'"""
]

The desired output:

How Doth the Little Busy Bee,
I'll try again.

Is there a way to extract the citations (can appear in both single and double quotes) with nltk with built-in or third-party tokenizers?

I've tried using the SExprTokenizer tokenizer providing the single and double quotes as parens values but the result was far from the desired, e.g.:

In [1]: from nltk import SExprTokenizer
    ...: 
    ...: 
    ...: sentences = [
    ...:     """Well, I've tried to say "How Doth the Little Busy Bee," but it all came different!""",
    ...:     """Alice replied in a very melancholy voice. She continued, 'I'll try again.'"""
    ...: ]
    ...: 
    ...: tokenizer = SExprTokenizer(parens='""', strict=False)
    ...: for sentence in sentences:
    ...:     for item in tokenizer.tokenize(sentence):
    ...:         print(item)
    ...:     print("----")
    ...:     
Well,
I've
tried
to
say
"
How
Doth
the
Little
Busy
Bee,
"
 but it all came different!
----
Alice replied in a very melancholy voice. She continued, 'I'll try again.'

There were similar threads like this and this, but all of them suggest a regex-based approach, but, I'm curious if this can be solved with nltk only - sounds like a common task in Natural Language Processing.

score 1 · Accepted Answer · answered Sep 27 '16 at 16:51

Well, under the hood, SExprTokenizer is a regex-based approach as well, as can be seen from the source code you linked to.
What also can be seen from the source is that the authors apparently didn't consider that the opening and closing "paren" are represented with the same character. The depth of the nesting is increased and decreased in the same iteration, so the quote seen by the tokenizer is the empty string.

Identifying quotes is not that common in NLP, I think. People use quotes in many different ways (especially if you deal with different languages...), so it's quite hard to get it right in a robust approach. For many NLP applications quoting is just ignored, I'd say...

Extracting quotations/citations with nltk (not regex)

1 Answers1