I'm trying to read csv files with ^A(\001) delimiter in pyspark. I have gone through the below link and as mentioned in link I tried the same approach and it's working as expected i.e. I was able to read the csv files and process them further.
Link: How to parse a csv that uses ^A (i.e. \001) as the delimiter with spark-csv?
Working
spark.read.option("wholeFile", "true"). \
option("inferSchema", "false"). \
option("header", "true"). \
option("quote", "\""). \
option("multiLine", "true"). \
option("delimiter", "\u0001"). \
csv("path/to/csv/file.csv")
Instead of hard coding the delimiter, I want to read it from database and below is the approach I tried.
update table set field_delimiter= 'field_delimiter=\\u0001'
(Key value pair. Using the key, I'm accessing the value)
delimiter = config.FIELD_DELIMITER (This will fetch the delimiter from the database)
>>print(delimiter)
\u0001
Not Working
spark.read.option("wholeFile", "true"). \
option("inferSchema", "false"). \
option("header", "true"). \
option("quote", "\""). \
option("multiLine", "true"). \
option("delimiter", delimiter). \
csv("path/to/csv/file.csv")
Error:
: java.lang.IllegalArgumentException: Unsupported special character for delimiter: \u0001
at org.apache.spark.sql.execution.datasources.csv.CSVUtils$.toChar(CSVUtils.scala:106)
at org.apache.spark.sql.execution.datasources.csv.CSVOptions.<init>(CSVOptions.scala:83)
at org.apache.spark.sql.execution.datasources.csv.CSVOptions.<init>(CSVOptions.scala:39)
at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:55)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:178)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:178)
at scala.Option.orElse(Option.scala:289)
at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:177)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:353)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:533)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)