0

I would like to change the new line delimiter to "\u0001" in pyspark. How can I do that? when doing the following it still uses the newline "\n" delimiter. thanks!

from pyspark import SparkContext, SparkConf

# create a SparkConf object with some configuration options
conf = SparkConf().setAppName('example').setMaster('local[*]')
conf.set("textinputformat.record.delimiter", "\u0002")

# create a SparkContext object with the SparkConf object
sc = SparkContext(conf=conf)

rdd = sc.textFile(f"MY_PATH")
dotan
  • 1,484
  • 3
  • 11
  • 13

1 Answers1

1

I found the following answer that worked for me:

from pyspark import SparkContext, SparkConf

path = <MY_PATH>
# create a SparkConf object with some configuration options
conf = SparkConf().setAppName('example').setMaster('local[*]')

# create a SparkContext object with the SparkConf object
sc = SparkContext(conf=conf)

output_format_class = "org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat"
input_format_class = "org.apache.hadoop.mapreduce.lib.input.TextInputFormat"
key_class = "org.apache.hadoop.io.Text"
value_class = "org.apache.hadoop.io.LongWritable"

rdd = sc.textFile(path)
rconf = {"textinputformat.record.delimiter": "\u0002"}
rdd = sc.newAPIHadoopFile(path,
                          input_format_class,
                          key_class,
                          value_class, 
                          conf=rconf)

dotan
  • 1,484
  • 3
  • 11
  • 13