There is a way to "unflatten" a list in Python (see, for example, HERE). However, how to do that efficiently given specific elements? Here is a slightly altered beginning of Jane Austen's "Pride and Prejudice":
Austen = """ONE: It is a truth universally acknowledged, ONE: that a single man in possession
of a good fortune, must be in want of a wife.
TWO: However little known the feelings or views of such a man may be on his
first entering a neighbourhood, ONE: this truth is so well fixed in the minds
of the surrounding families, THREE: that he is considered as the rightful
property of some one or other of their daughters.
TWO: "My dear Mr. Bennet," said his lady to him one day, ONE: "have you heard that
Netherfield Park is let at last?"
"""
Note, there are some added points:
BREAK_POINTS = ('ONE:', 'TWO:', 'THREE:')
Using RegEx Tokenizer from nltk
it is quite easy to get the list of word tokens:
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\s+', gaps=True)
tokens = []
for line in Austen.splitlines():
if line == '':
continue
tokens += tokenizer.tokenize(line)
tokens
['ONE:', 'It', 'is', 'a', 'truth', 'universally',
'acknowledged,', 'ONE:', 'that', 'a', 'single', 'man', 'in',
'possession', 'of', 'a', 'good', 'fortune,', 'must', 'be', 'in',
'want', 'of', 'a', 'wife.', 'TWO:', 'However', 'little', 'known',
'the', 'feelings', 'or', 'views', 'of', 'such', 'a', 'man', 'may',
'be', 'on', 'his', 'first', 'entering', 'a', 'neighbourhood,',
'ONE:', 'this', 'truth', 'is', 'so', 'well', 'fixed', 'in', 'the',
'minds', 'of', 'the', 'surrounding', 'families,', 'THREE:', 'that',
'he', 'is', 'considered', 'as', 'the', 'rightful', 'property', 'of',
'some', 'one', 'or', 'other', 'of', 'their', 'daughters.', 'TWO:',
'"My', 'dear', 'Mr.', 'Bennet,"', 'said', 'his', 'lady', 'to',
'him', 'one', 'day,', 'ONE:', '"have', 'you', 'heard', 'that',
'Netherfield', 'Park', 'is', 'let', 'at', 'last?"']
How can I "unflatten" that list using BREAK_POINT
. In particular, if a BREAK_POINT repeats, like the first two 'ONE'
, that should be ignored.
[['ONE:', 'It', 'is', 'a', 'truth', 'universally', 'acknowledged,',
'that', 'a', 'single', 'man', 'in', 'possession', 'of', 'a', 'good',
'fortune,', 'must', 'be', 'in', 'want', 'of', 'a', 'wife.'],
['TWO:', 'However', 'little', 'known', 'the', 'feelings', 'or',
'views', 'of', 'such', 'a', 'man', 'may', 'be', 'on', 'his',
'first', 'entering', 'a', 'neighbourhood,'], ['ONE:', 'this',
'truth', 'is', 'so', 'well', 'fixed', 'in', 'the', 'minds', 'of',
'the', 'surrounding', 'families,'], ['THREE:', 'that', 'he', 'is',
'considered', 'as', 'the', 'rightful', 'property', 'of', 'some',
'one', 'or', 'other', 'of', 'their', 'daughters.'], ['TWO:', '"My',
'dear', 'Mr.', 'Bennet,"', 'said', 'his', 'lady', 'to', 'him',
'one', 'day,'], ['ONE:', '"have', 'you', 'heard', 'that',
'Netherfield', 'Park', 'is', 'let', 'at', 'last?"']]