When I load a text file in an RDD, it is by default splitted by each line. For example, consider the following text:
Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum
has been the industry's standard dummy text ever since the 1500s. When an
unknown printer took a galley of type and scrambled it to make a type specimen book
and publish it.
If I load it into an RDD like follows, the data is splitted by each line
>>> RDD =sc.textFile("Dummy.txt")
>>> RDD.count()
4
>>> RDD.collect()
['Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum ',
'has been the industry's standard dummy text ever since the 1500s. When an ',
'unknown printer took a galley of type and scrambled it to make a type specimen book',
'and publish it.']
Since there are 4 lines in the text file, RDD.count()
gives 4 as output. Similarly the list RDD.collect()
contains 4 strings. But, is there a way to load your file such that it is parallelized by sentences and not by lines, in that case the output should be as follows
>>> RDD.count()
3
>>> RDD.collect()
['Lorem Ipsum is simply dummy text of the printing and typesetting industry.', 'Lorem Ipsum
has been the industry's standard dummy text ever since the 1500s.', 'When an unknown
printer took a galley of type and scrambled it to make a type specimen book and publish it.']
Can I pass some argument to sc.textFile
such that my data is split when ever a fullstop appears and not when a line in the text file ends