NLTK
http://www.nltk.org/ is a toolkit for computational linguistics.
I am trying to manipulate sentences, using the sents()
method:
from nltk.corpus import gutenberg
it fetches texts by fileid
:
hamlet = gutenberg.sents('shakespeare-hamlet.txt')
the output is:
print hamlet
[['[', 'The', 'Tragedie', 'of', 'Hamlet', 'by', 'William', 'Shakespeare', '1599', ']'], ['Actus', 'Primus', '.'], ...]
But let's say I want to make a list of sentences by author instead of by book.
In a repetitive way (it won't let me extend()
lists):
shakespeare = []
hamlet = gutenberg.sents('shakespeare-hamlet.txt')
macbeth = gutenberg.sents('shakespeare-macbeth.txt')
caesar = gutenberg.sents('shakespeare-caesar.txt')
shakespeare.append(hamlet)
shakespeare.append(macbeth)
shakespeare.append(caesar)
but then it all becomes nested:
print shakespeare
[[['[', 'The', 'Tragedie', 'of', 'Hamlet', 'by', 'William', 'Shakespeare', '1599', ']'], ['Actus', 'Primus', '.'], ...], [['[', 'The', 'Tragedie', 'of', 'Macbeth', 'by', 'William', 'Shakespeare', '1603', ']'], ['Actus', 'Primus', '.'], ...], [['[', 'The', 'Tragedie', 'of', 'Julius', 'Caesar', 'by', 'William', 'Shakespeare', '1599', ']'], ['Actus', 'Primus', '.'], ...]]
Is there a way I can end up with ONE list with all concatenated sentences, not nested, like this?
['[', 'The', 'Tragedie', 'of', 'Hamlet', 'by', 'William', 'Shakespeare', '1599', ']'], ['Actus', 'Primus', '.'], ...], [['[', 'The', 'Tragedie', 'of', 'Macbeth', 'by', 'William', 'Shakespeare', '1603', ']'], ['Actus', 'Primus', '.'], ...], [['[', 'The', 'Tragedie', 'of', 'Julius', 'Caesar', 'by', 'William', 'Shakespeare', '1599', ']'], ['Actus', 'Primus', '.'], ...]]