How to parse a csv that uses ^A (i.e. \001) as the delimiter with spark-csv?

Question

Terribly new to spark and hive and big data and scala and all. I'm trying to write a simple function that takes an sqlContext, loads a csv file from s3 and returns a DataFrame. The problem is that this particular csv uses the ^A (i.e. \001) character as the delimiter and the dataset is huge so I can't just do a "s/\001/,/g" on it. Besides, the fields might contain commas or other characters I might use as a delimiter.

I know that the spark-csv package that I'm using has a delimiter option, but I don't know how to set it so that it will read \001 as one character and not something like an escaped 0, 0 and 1. Perhaps I should use hiveContext or something?

score 30 · Accepted Answer · answered Mar 15 '16 at 09:55

30

If you check the GitHub page, there is a delimiter parameter for spark-csv (as you also noted). Use it like this:

val df = sqlContext.read
    .format("com.databricks.spark.csv")
    .option("header", "true") // Use first line of all files as header
    .option("inferSchema", "true") // Automatically infer data types
    .option("delimiter", "\u0001")
    .load("cars.csv")

answered Mar 15 '16 at 09:55

Daniel Zolnai

16,487
7
59
71

Thank you! I didn't know about the \u0 thing. Could you explain a bit more exactly what it means and does? I'm guessing 'u' is for unicode, but I want to understand this thing properly. – Mar 15 '16 at 10:12
2

Well the \ char marks the beginning of an escape sequence, meaning that the following character is not part of the string, but has a special meaning. The `u` character means that the following numbers are a Unicode code for a character, and 0001 is the Unicode code for that special character. So what it does, it just inserts that special character in the string. – Daniel Zolnai Mar 15 '16 at 10:31
8

use '\x01' as the delimiter in case you are using pyspark – ghosts Aug 10 '17 at 21:41
1

Did the above solution worked .option("delimiter", "\u0001"). Its giving me an error as given below java.lang.IllegalArgumentException: Unsupported special character for delimiter: \u0001 at org.apache.spark.sql.execution.datasources.csv.CSVUtils$.toChar(CSVUtils.scala:106) at org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:83) at org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:39) – vinu.m.19 Apr 24 '19 at 20:12
If you are using Spark 2.x, then you are using the built-in csv parser, which does not support setting any character as the delimiter as of now. – Daniel Zolnai Apr 25 '19 at 05:19
This answer worked for me: https://stackoverflow.com/a/46349762/1316649 – fstang Jun 26 '19 at 02:40

Mark Rajcok · Answer 2 · 2019-05-07T19:33:25.710

3

With Spark 2.x and the CSV API, use the sep option:

val df = spark.read
  .option("sep", "\u0001")
  .csv("path_to_csv_files")

edited May 07 '19 at 19:33

answered May 07 '19 at 16:46

Mark Rajcok

362,217
114
495
492

How to parse a csv that uses ^A (i.e. \001) as the delimiter with spark-csv?

2 Answers2

Linked