Apache Spark data file with multi characters delimiter

Question

I have a "CSV" file that uses a multi-character delimiter, so the data looks something like

field1^|^,^|^field2^|^,^|^field3^|^,^|^field4

Following code in a notebook inside Databricks throws the error (shown below) on the second line where it tries to write dataframe df to a destination table:

df = spark.read.csv(".../Test/MyFile.csv", sep="^|^,^|^", header="true", inferSchema="true")
df.write.(.....)

Error:

java.sql.SQLException: Spark Dataframe and SQL Server table have differing numbers of columns

Both the csv files and the database tables have exact same column names and have exact number of columns. The error seems to indicate that the delimiter ^|^,^|^ is not being recognized. Is there a way to parse this data file with this ^|^,^|^ delimiter?

UPDATE: My bad, I forgot to include sep attribute in the spark.read.csv(...) call above. I have added that attribute now, and its value ^|^,^|^. But the error is still the same.

https://stackoverflow.com/a/46349762/8279585 -- refer this answer — samkart, May 12 '22 at 14:55

score 0 · Answer 1 · answered May 12 '22 at 03:23

0

You can try 2 things on databricks

Give the option for spark to read with this delimiter like spark.read.option("delimiter", "^|^,^|^").csv(".../Test/MyFile.csv", header = "true", inferSchema = "true") - Reference
Have python or shell script to handle that before the file gets loaded by Spark! as follow. Reference

with open(".../Test/MyFile.csv") as f:
    newText = f.read().replace('^|^,^|^', '|')

Also, you can try df.show() to make sure your dataframe columns are read properly.

answered May 12 '22 at 03:23

Phuri Chalermkiatsakul

551
4
10

I already have that `sep` attribute used in the call to `spark.read.csv(...)`. It was my fault, I had forgotten to add it to the code. I just added it now and added an **UPDATE** section. But the error is still there. It's a `UTF-16 LE BOM` file, I wonder if it has anything to do with the error. – nam May 12 '22 at 16:17
You may try adding an option `.option("encoding", "UTF-16")`. If you are lucky, it will work! It seems this is an [unresolved SPARK issue](https://issues.apache.org/jira/browse/SPARK-32965). – Phuri Chalermkiatsakul May 13 '22 at 02:07

Apache Spark data file with multi characters delimiter

1 Answers1

Linked