For example we have following text:
"Spark is a framework for writing fast, distributed programs. Spark solves similar problems as Hadoop MapReduce does but with a fast in-memory approach and a clean functional style API. ..."
I need all possible section of this text respectively, for one word by one word, then two by two, three by three to five to five. like this:
ones : ['Spark', 'is', 'a', 'framework', 'for', 'writing, 'fast', 'distributed', 'programs', ...]
twos : ['Spark is', 'is a', 'a framework', 'framework for', 'for writing' ...]
threes : ['Spark is a', 'is a framework', 'a framework for', 'framework for writing', 'for writing fast', ...]
. . .
fives : ['Spark is a framework for', 'is a framework for writing', 'a framework for writing fast','framework for writing fast distributed', ...]
Please note that the text to be processed is huge text( about 100GB). I need the best solution for this process. May be it should be processed multi thread in parallel.
I don't need whole list at once, it can be streaming.