Delimiter in pyspark sparkcontext delimiter

Question

I would like to change the new line delimiter to "\u0001" in pyspark. How can I do that? when doing the following it still uses the newline "\n" delimiter. thanks!

from pyspark import SparkContext, SparkConf

# create a SparkConf object with some configuration options
conf = SparkConf().setAppName('example').setMaster('local[*]')
conf.set("textinputformat.record.delimiter", "\u0002")

# create a SparkContext object with the SparkConf object
sc = SparkContext(conf=conf)

rdd = sc.textFile(f"MY_PATH")

https://stackoverflow.com/questions/60643825/pyspark-how-to-input-a-text-file-such-that-it-is-split-by-fullstop does this help? — cs95, Apr 16 '23 at 18:57

score 1 · Answer 1 · answered Apr 16 '23 at 19:32

I found the following answer that worked for me:

from pyspark import SparkContext, SparkConf

path = <MY_PATH>
# create a SparkConf object with some configuration options
conf = SparkConf().setAppName('example').setMaster('local[*]')

# create a SparkContext object with the SparkConf object
sc = SparkContext(conf=conf)

output_format_class = "org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat"
input_format_class = "org.apache.hadoop.mapreduce.lib.input.TextInputFormat"
key_class = "org.apache.hadoop.io.Text"
value_class = "org.apache.hadoop.io.LongWritable"

rdd = sc.textFile(path)
rconf = {"textinputformat.record.delimiter": "\u0002"}
rdd = sc.newAPIHadoopFile(path,
                          input_format_class,
                          key_class,
                          value_class, 
                          conf=rconf)

Delimiter in pyspark sparkcontext delimiter

1 Answers1