3

NLTK http://www.nltk.org/ is a toolkit for computational linguistics.

I am trying to manipulate sentences, using the sents() method:

from nltk.corpus import gutenberg

it fetches texts by fileid:

hamlet = gutenberg.sents('shakespeare-hamlet.txt')

the output is:

print hamlet
[['[', 'The', 'Tragedie', 'of', 'Hamlet', 'by', 'William', 'Shakespeare', '1599', ']'], ['Actus', 'Primus', '.'], ...]

But let's say I want to make a list of sentences by author instead of by book. In a repetitive way (it won't let me extend() lists):

shakespeare = []

hamlet = gutenberg.sents('shakespeare-hamlet.txt')
macbeth = gutenberg.sents('shakespeare-macbeth.txt')
caesar = gutenberg.sents('shakespeare-caesar.txt')

shakespeare.append(hamlet)
shakespeare.append(macbeth)
shakespeare.append(caesar)

but then it all becomes nested:

print shakespeare

[[['[', 'The', 'Tragedie', 'of', 'Hamlet', 'by', 'William', 'Shakespeare', '1599', ']'], ['Actus', 'Primus', '.'], ...], [['[', 'The', 'Tragedie', 'of', 'Macbeth', 'by', 'William', 'Shakespeare', '1603', ']'], ['Actus', 'Primus', '.'], ...], [['[', 'The', 'Tragedie', 'of', 'Julius', 'Caesar', 'by', 'William', 'Shakespeare', '1599', ']'], ['Actus', 'Primus', '.'], ...]]

Is there a way I can end up with ONE list with all concatenated sentences, not nested, like this?

['[', 'The', 'Tragedie', 'of', 'Hamlet', 'by', 'William', 'Shakespeare', '1599', ']'], ['Actus', 'Primus', '.'], ...], [['[', 'The', 'Tragedie', 'of', 'Macbeth', 'by', 'William', 'Shakespeare', '1603', ']'], ['Actus', 'Primus', '.'], ...], [['[', 'The', 'Tragedie', 'of', 'Julius', 'Caesar', 'by', 'William', 'Shakespeare', '1599', ']'], ['Actus', 'Primus', '.'], ...]]
Daniel
  • 2,345
  • 4
  • 19
  • 36
8-Bit Borges
  • 9,643
  • 29
  • 101
  • 198
  • Your example at the bottom isn't a valid list. It would help if you gave a little more info about what you are looking to achieve (and/or double-checking your example). – Daniel Jun 08 '16 at 03:41
  • @Daniel there you go, I've edited the bottom example. thank you for alerting me. the syntax for the above examples are exactly right. – 8-Bit Borges Jun 08 '16 at 03:58
  • My pleasure - although, it's still not quite there (your example as it is written throws a `SyntaxError`). I think I see what you are going for: it looks like you want this `['[', 'The', 'Tragedie', 'of', 'Hamlet',` for the 1st element instead of this `[', 'The', 'Tragedie', 'of', 'Hamlet',` but that's just a guess. Just saw your most recent edit, though, which makes it a lot more clear - thanks! – Daniel Jun 08 '16 at 04:16
  • have you had a chance too look at http://stackoverflow.com/questions/16176742/python-3-replacement-for-deprecated-compiler-ast-flatten-function ? – Jerzyk Jun 08 '16 at 04:54

3 Answers3

2

The best solution is to just fetch them all at once-- the sentences come the way you want them. The nltk's corpus readers accept either a single filename or a list of files:

shakespeare = gutenberg.sents(['shakespeare-hamlet.txt',
                 'shakespeare-macbeth.txt', 'shakespeare-caesar.txt'])

In other situations, if you have several lists and you want to concatenate them you should use extend(), not append():

shakespeare.extend(macbeth)
shakespeare.extend(caesar)
alexis
  • 48,685
  • 16
  • 101
  • 161
1

I agree w/ Alexis that the ideal is to fetch them all at once from the gutenberg corpus. For anyone in the future looking to concatenate sentences from separate corpuses, you could also try this pythonic approach:

hamlet = gutenberg.sents('shakespeare-hamlet.txt')
macbeth = gutenberg.sents('shakespeare-macbeth.txt')
caesar = gutenberg.sents('shakespeare-caesar.txt')

shakespeare = hamlet + macbeth + caesar
0

You can use itertools.chain after appending to your list shakespeare:

from itertools import chain

lis = list(chain.from_iterable(shakespeare))

# output:
# [
#   ['[', 'The', 'Tragedie', 'of', 'Hamlet', 'by', 'William', 'Shakespeare', '1599', ']'],
#   ['Actus', 'Primus', '.'],
#   ['[', 'The', 'Tragedie', 'of', 'Macbeth', 'by', 'William', 'Shakespeare', '1603', ']'],
#   ['Actus', 'Primus', '.'],
#   ['[', 'The', 'Tragedie', 'of', 'Julius', 'Caesar', 'by', 'William', 'Shakespeare', '1599', ']'],
#   ['Actus', 'Primus', '.']
# ]

You could also opt for a list comprehension with a double-loop:

lis = [y for x in shakespeare for y in x]
Daniel
  • 2,345
  • 4
  • 19
  • 36