1

In spark shell, i'm reading an input file and trimming the field values next saving the final rdd using saveAsTextFile() method. The field separator in the input file is '|' but the in the output file I'm getting the field separator as ','.

Input Format: abc | def | xyz

Default Output Format: abc,def,xyz

Required output format something like abc|def|xyz

Is there anyway to change the default output delimiter value to '|', if yes than please suggest.

subodh
  • 6,136
  • 12
  • 51
  • 73
VSP
  • 293
  • 2
  • 5
  • 10
  • Possible duplicate of [remove parentheses from output in spark](http://stackoverflow.com/questions/29945330/remove-parentheses-from-output-in-spark) – Utkarsh Oct 24 '16 at 06:33

1 Answers1

1

For an RDD, you'll just need to make a string with a pipe separated value on the product iterator :

scala> val rdd = sc.parallelize(Seq(("a", 1, 3), ("b", 2, 10)))
// rdd: org.apache.spark.rdd.RDD[(String, Int, Int)] = ParallelCollectionRDD[11] at parallelize at <console>:27

scala> rdd.map { x => x.productIterator.toSeq.mkString("|") }
// res9: Array[String] = Array(a|1|3, b|2|10)

scala> scala> rdd.map { x => x.productIterator.toSeq.mkString("|") }.saveAsTextFile("test")

Now let's check the content of the files :

$ cat test/part-0000*
a|1|3
b|2|10
eliasah
  • 39,588
  • 11
  • 124
  • 154