Spark csv read ^A(\001)

Question

I'm trying to read csv files with ^A(\001) delimiter in pyspark. I have gone through the below link and as mentioned in link I tried the same approach and it's working as expected i.e. I was able to read the csv files and process them further.

Link: How to parse a csv that uses ^A (i.e. \001) as the delimiter with spark-csv?

Working

spark.read.option("wholeFile", "true"). \
                    option("inferSchema", "false"). \
                    option("header", "true"). \
                    option("quote", "\""). \
                    option("multiLine", "true"). \
                    option("delimiter", "\u0001"). \
                    csv("path/to/csv/file.csv")

Instead of hard coding the delimiter, I want to read it from database and below is the approach I tried.

update table set field_delimiter= 'field_delimiter=\\u0001'

(Key value pair. Using the key, I'm accessing the value)

delimiter = config.FIELD_DELIMITER (This will fetch the delimiter from the database)
>>print(delimiter) 
 \u0001

Not Working

spark.read.option("wholeFile", "true"). \
                    option("inferSchema", "false"). \
                    option("header", "true"). \
                    option("quote", "\""). \
                    option("multiLine", "true"). \
                    option("delimiter", delimiter). \
                    csv("path/to/csv/file.csv")

Error:

: java.lang.IllegalArgumentException: Unsupported special character for delimiter: \u0001
    at org.apache.spark.sql.execution.datasources.csv.CSVUtils$.toChar(CSVUtils.scala:106)
    at org.apache.spark.sql.execution.datasources.csv.CSVOptions.<init>(CSVOptions.scala:83)
    at org.apache.spark.sql.execution.datasources.csv.CSVOptions.<init>(CSVOptions.scala:39)
    at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:55)
    at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:178)
    at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:178)
    at scala.Option.orElse(Option.scala:289)
    at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:177)
    at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:353)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
    at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:533)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:280)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:214)
    at java.lang.Thread.run(Thread.java:748)

Aren't unicode values stored like `u'\u0001'` in Python? Isn't that how you should store it since youre storing in a variable? — philantrovert, Apr 19 '18 at 17:30
@philantrovert, Could you please elaborate on storing the filed in python please. `delimiter = config.FIELD_DELIMITER` This is how I'm strong as of now. — data_addict, Apr 19 '18 at 17:45
In your first example, the delimiter is a String. Maybe the `delimiter` read from the database is returned as a character? — Dan W, Apr 19 '18 at 18:08
I am working on something similar as well. You need to specifically specify to read this delimiter as Unicode. — kruparulz14, Feb 20 '20 at 00:27
@data_addict : DId you resolve this issue? I'm in a similar situation. I'm trying to escape c-cedilla (\u0039) and reading it from the database. But I end up with similar issues. Despite adding `.option("encoding","UTF-8")` I see the same error. — underwood, Mar 27 '20 at 19:35
@data_addict Did you resolved this issue? I am facing the same issue. Any leads will be helpful. — Amrutha K, Sep 23 '20 at 10:28

Adarsh NamdevCHS · Answer 1 · 2022-07-18T05:37:40.813

I'm working on a file that is having a same delimiter i.e. "\u0001".

To make it work in the python 3.x version, I imported:

from __future__ import unicode_literals

and read my file into a dataframe :

df = spark.read.format("csv").option("inferSchema", True)\
     .option("delimiter",u"\u0001").load(r"/application/file.csv")

Output

+--------------+------------------------------------+------+---------------------+---------------------+-----------+------------------+-------------------+-----------------+
|ts            |id                                  |source|FaM                  |record_Num           |primlim_no |first_name        |middle_name        |last_name        |
+--------------+------------------------------------+------+---------------------+---------------------+-----------+------------------+-------------------+-----------------+
|20150728133902|3d942d41-edde-419c-a15b             |AS4   |AGC                  |300104               |76000389072|lalal             |H                  |RAMEN            |
|20150728133902|5277f150-6890-4c99-b85a             |AS4   |AGC                  |3001261              |76000027136|roberta           |null               |BIRDY            |
|20150728133902|10c8f16b-cc2f-42b4-810d             |AS4   |AGC                  |400005920            |76000328013|bobby             |L                  |LORDS            |
|20150728133902|5c1a8c4c-a590-4b3b-95f5             |AS4   |AGC                  |3154018172           |76000054981|jackie            |A                  |DOWN             |
|20150728133902|a510763b-57da-4767-972d             |AS4   |AGC                  |3059318259           |76000350660|rob               |W                  |THORN            |

well, you can also put the delimiter "\x01" , which is an equivalent of "\u0001" — Adarsh NamdevCHS, Jul 21 '22 at 11:24

Spark csv read ^A(\001)

1 Answers1