Scala - how to pass delimiter as a variable when writing dataframe as csv

Question

Using a variable as a delimiter to dataframe.write.csv is not working. Trying out alternatives is working out to be too complicated.

 val df = Seq(("a", "b", "c"), ("a1", "b1", "c1")).toDF("A", "B", "C")
 val delim_char = "\u001F"

 df.coalesce(1).write.option("delimiter", delim_char).csv("file:///var/tmp/test")  // Does not work -- error related to too many chars
 df.coalesce(1).write.option("delimiter", "\u001F").csv("file:///var/tmp/test")  //works fine...

I have tried .toHexString, and many other alternatives...

It works for both when you give direct string value or pass a variable that holds the string value. You get too many character length issue if you use single quotes while declaration '\u001F', but for the above declaration when you use double quotes "\u001F" you should not face any issues. — Mansoor Baba Shaik, Aug 24 '18 at 03:28
@Mansoor, As stated, it does not work in Scala 2.11.8... Any help would be greatly appreciated. — Terry, Aug 24 '18 at 03:49

score 1 · Accepted Answer · answered Aug 24 '18 at 04:09

Your declaration works very well. It works for both when you give direct string value or pass reference variable. And you will get character length error only if you enclose the delimiter value in single quotes '\u001F'. It has nothing to do with Scala 2.11.8.

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://xx.x.xxx.xx:xxxx
Spark context available as 'sc' (master = local[*], app id = local-1535083313716).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.2.0.2.6.3.0-235
      /_/

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_144)
Type in expressions to have them evaluated.
Type :help for more information.

scala> import java.io.File
import java.io.File

scala> import org.apache.spark.sql.{Row, SaveMode, SparkSession}
import org.apache.spark.sql.{Row, SaveMode, SparkSession}

scala> val warehouseLocation = new File("spark-warehouse").getAbsolutePath
warehouseLocation: String = /usr/hdp/2.6.3.0-235/spark2/spark-warehouse

scala> val spark = SparkSession.builder().appName("app").config("spark.sql.warehouse.dir", warehouseLocation).enableHiveSupport().getOrCreate()
18/08/24 00:02:25 WARN SparkSession$Builder: Using an existing SparkSession; some configuration may not take effect.
spark: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@37d3e740

scala> import spark.implicits._
import spark.implicits._

scala> import spark.sql
import spark.sql

scala> val df = Seq(("a", "b", "c"), ("a1", "b1", "c1")).toDF("A", "B", "C")
df: org.apache.spark.sql.DataFrame = [A: string, B: string ... 1 more field]

scala> val delim_char = "\u001F"
delim_char: String = ""

scala> df.coalesce(1).write.option("delimiter", delim_char).csv("file:///var/tmp/test")

scala>

Thanks for checking. On my system, I get java.lang... error related to more than one character being passed. — Terry, Aug 24 '18 at 04:13

score 0 · Answer 2 · answered Aug 24 '18 at 14:25

Thank you for your help.

The code above works, when tested, and I could not find a way to showcase how the problem was being generated. However, the problem was that, there was a variable assigned to a string (which was Unicode "\u001F", println was showing the result as String: \u001F), after it was collected from a csv file.

Several approaches were tried. Finally found the solution in another Stackoverflow question related to string unicode...

1) Did not Work -- delim_char.format("unicode-escape")

2) Worked --

def unescapeUnicode(str: String): String = 
     """\\u([0-9a-fA-F]{4})""".r.replaceAllIn(str, 
     m => Integer.parseInt(m.group(1), 16).toChar.toString)

unescapeUnicode(delim_char)

Scala - how to pass delimiter as a variable when writing dataframe as csv

2 Answers2