15

Terribly new to spark and hive and big data and scala and all. I'm trying to write a simple function that takes an sqlContext, loads a csv file from s3 and returns a DataFrame. The problem is that this particular csv uses the ^A (i.e. \001) character as the delimiter and the dataset is huge so I can't just do a "s/\001/,/g" on it. Besides, the fields might contain commas or other characters I might use as a delimiter.

I know that the spark-csv package that I'm using has a delimiter option, but I don't know how to set it so that it will read \001 as one character and not something like an escaped 0, 0 and 1. Perhaps I should use hiveContext or something?

2 Answers2

30

If you check the GitHub page, there is a delimiter parameter for spark-csv (as you also noted). Use it like this:

val df = sqlContext.read
    .format("com.databricks.spark.csv")
    .option("header", "true") // Use first line of all files as header
    .option("inferSchema", "true") // Automatically infer data types
    .option("delimiter", "\u0001")
    .load("cars.csv")
Daniel Zolnai
  • 16,487
  • 7
  • 59
  • 71
  • Thank you! I didn't know about the \u0 thing. Could you explain a bit more exactly what it means and does? I'm guessing 'u' is for unicode, but I want to understand this thing properly. –  Mar 15 '16 at 10:12
  • 2
    Well the \ char marks the beginning of an escape sequence, meaning that the following character is not part of the string, but has a special meaning. The `u` character means that the following numbers are a Unicode code for a character, and 0001 is the Unicode code for that special character. So what it does, it just inserts that special character in the string. – Daniel Zolnai Mar 15 '16 at 10:31
  • 8
    use '\x01' as the delimiter in case you are using pyspark – ghosts Aug 10 '17 at 21:41
  • 1
    Did the above solution worked .option("delimiter", "\u0001"). Its giving me an error as given below java.lang.IllegalArgumentException: Unsupported special character for delimiter: \u0001 at org.apache.spark.sql.execution.datasources.csv.CSVUtils$.toChar(CSVUtils.scala:106) at org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:83) at org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:39) – vinu.m.19 Apr 24 '19 at 20:12
  • If you are using Spark 2.x, then you are using the built-in csv parser, which does not support setting any character as the delimiter as of now. – Daniel Zolnai Apr 25 '19 at 05:19
  • This answer worked for me: https://stackoverflow.com/a/46349762/1316649 – fstang Jun 26 '19 at 02:40
3

With Spark 2.x and the CSV API, use the sep option:

val df = spark.read
  .option("sep", "\u0001")
  .csv("path_to_csv_files")
Mark Rajcok
  • 362,217
  • 114
  • 495
  • 492