Pyspark: How to input a text file such that it is split by fullstop

Question

When I load a text file in an RDD, it is by default splitted by each line. For example, consider the following text:

Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum 
has been the industry's standard dummy text ever since the 1500s. When an 
unknown printer took a galley of type and scrambled it to make a type specimen book
and publish it.

If I load it into an RDD like follows, the data is splitted by each line

>>> RDD =sc.textFile("Dummy.txt")
>>> RDD.count()
    4
>>> RDD.collect()
    ['Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum ',
    'has been the industry's standard dummy text ever since the 1500s. When an ',
    'unknown printer took a galley of type and scrambled it to make a type specimen book',
    'and publish it.']

Since there are 4 lines in the text file, RDD.count() gives 4 as output. Similarly the list RDD.collect() contains 4 strings. But, is there a way to load your file such that it is parallelized by sentences and not by lines, in that case the output should be as follows

>>> RDD.count()
    3
>>> RDD.collect()
    ['Lorem Ipsum is simply dummy text of the printing and typesetting industry.', 'Lorem Ipsum 
    has been the industry's standard dummy text ever since the 1500s.', 'When an unknown
    printer took a galley of type and scrambled it to make a type specimen book and publish it.']

Can I pass some argument to sc.textFile such that my data is split when ever a fullstop appears and not when a line in the text file ends

@dassum, that works for dataframes. You can't really specify options to rdd's textFile method — Mohd Avais, Mar 11 '20 at 20:45

score 1 · Accepted Answer · answered Mar 11 '20 at 21:08

RDD's textFile method internally uses hadoop's TextInputFormat to read the text files. The default key, value pair translates to the record offset and the entire record with default delimiter as '\n' The easy way to go through this is to read in the file as dataFrame's csv method specifying delimiter as "." as below:

spark.read.option("delimiter", ".").csv("path to your file")

The catch here is it will split the sentences to columns and not rows which might not be feasible for hundred's of sentences.

Other way around is to tweak hadoop's text input format's default delimiter from '\n' to '.'

This can be done in a way like this

 val conf = new org.apache.hadoop.conf.Configuration
 conf.set("textinputformat.record.delimiter", "\u002E")
 sc.textFile.newAPIHadoopFile(file-path, 
     classOf[org.apache.hadoop.mapreduce.lib.input.TextInputFormat],
     classOf[org.apache.hadoop.io.LongWritable],
     classOf[org.apache.hadoop.io.Text],
     conf).count()

Alternatively, I guess you can also write your custom Input format method and use above newAPIHadoopFile or hadoopFile methods to read in files

score 0 · Answer 2 · answered Mar 11 '20 at 21:04

I got my answer in one of the answers here written by singer. The answer goes as follows:

rdd = sc.newAPIHadoopFile(YOUR_FILE, "org.apache.hadoop.mapreduce.lib.input.TextInputFormat",
            "org.apache.hadoop.io.LongWritable", "org.apache.hadoop.io.Text",
            conf={"textinputformat.record.delimiter": YOUR_DELIMITER}).map(lambda l:l[1])

score 0 · Answer 3 · answered Mar 11 '20 at 21:26

In Scala we can do collect() + .mkString to create string then split on .

Example:

spark.sparkContext.parallelize(spark.sparkContext.textFile("<file_path>").collect().mkString.split("\\.")).count()

//3

spark.sparkContext.parallelize(spark.sparkContext.textFile("<file_path>").collect().mkString.split("\\.")).toDF().show(false)

//+----------------------------------------------------------------------------------------------------------+
//|_1                                                                                                        |
//+----------------------------------------------------------------------------------------------------------+
//|Lorem Ipsum is simply dummy text of the printing and typesetting industry                                 |
//| Lorem Ipsum has been the industry's standard dummy text ever since the 1500s                             |
//| When an unknown printer took a galley of type and scrambled it to make a type specimen bookand publish it|
//+----------------------------------------------------------------------------------------------------------+

a nice trick. But how feasible is this for a huge file. Collect() is a big burden. — Mohd Avais, Mar 12 '20 at 05:40

Pyspark: How to input a text file such that it is split by fullstop

3 Answers3

Linked