The input list of sentences:
sentences = [
"""Well, I've tried to say "How Doth the Little Busy Bee," but it all came different!""",
"""Alice replied in a very melancholy voice. She continued, 'I'll try again.'"""
]
The desired output:
How Doth the Little Busy Bee,
I'll try again.
Is there a way to extract the citations (can appear in both single and double quotes) with nltk
with built-in or third-party tokenizers?
I've tried using the SExprTokenizer
tokenizer providing the single and double quotes as parens
values but the result was far from the desired, e.g.:
In [1]: from nltk import SExprTokenizer
...:
...:
...: sentences = [
...: """Well, I've tried to say "How Doth the Little Busy Bee," but it all came different!""",
...: """Alice replied in a very melancholy voice. She continued, 'I'll try again.'"""
...: ]
...:
...: tokenizer = SExprTokenizer(parens='""', strict=False)
...: for sentence in sentences:
...: for item in tokenizer.tokenize(sentence):
...: print(item)
...: print("----")
...:
Well,
I've
tried
to
say
"
How
Doth
the
Little
Busy
Bee,
"
but it all came different!
----
Alice replied in a very melancholy voice. She continued, 'I'll try again.'
There were similar threads like this and this, but all of them suggest a regex-based approach, but, I'm curious if this can be solved with nltk
only - sounds like a common task in Natural Language Processing.