How to read csv file with additional comma in quotes using pyspark?

Question

I am having some troubles reading the following CSV data in UTF-16:

FullName, FullLabel, Type
TEST.slice, "Consideration":"Verde (Spar Verde, Fonte Verde)", Test,

As far as I understand, it should not be a problem for a reader, since there is a quote parameter to handle that.

df = spark.read.csv(file_path, header=True, encoding='UTF-16', quote = '"')

However, this would still give me an incorrect split:

Is there some way to handle those cases or do I need to work it around with RDD?

Thank you in advance.

try quote like this : quote = '\"' . That is using escape character. Let us know if that works. — vijayinani, Feb 07 '21 at 11:47
the quotes are supposed to enclose the whole column, not parts of the column — mck, Feb 07 '21 at 11:51

blackbishop · Accepted Answer · 2021-02-07T13:36:17.360

You can read as text using spark.read.text and split the values using some regex to split by comma but ignore the quotes (you can see this post), then get the corresponding columns from the resulting array:

from pyspark.sql import functions as F

df = spark.read.text(file_path)

df = df.filter("value != 'FullName, FullLabel, Type'") \
    .withColumn(
    "value",
    F.split(F.col("value"), ',(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)')
).select(
    F.col("value")[0].alias("FullName"),
    F.col("value")[1].alias("FullLabel"),
    F.col("value")[2].alias("Type")
)

df.show(truncate=False)

#+----------+--------------------------------------------------+-----+
#|FullName  |FullLabel                                         |Type |
#+----------+--------------------------------------------------+-----+
#|TEST.slice| "Consideration":"Verde (Spar Verde, Fonte Verde)"| Test|
#+----------+--------------------------------------------------+-----+

Update:

For input file in utf-16, you can replace spark.read.text by loading the file as binaryFiles and then convert the resulting rdd into dataframe :

df = sc.binaryFiles(file_path) \
    .flatMap(lambda x: [[l] for l in x[1].decode("utf-16").split("\n")]) \
    .toDF(["value"])

score 1 · Answer 2 · answered Feb 07 '21 at 13:48

Just another option as below (if you find it simple):

First read the text file as RDD and replace the ":" with ~:~ and save the text file.

sc.textFile(file_path).map(lambda x: x.replace('":"','~:~')).saveAsTextFile(tempPath)

Next, read the temp path and replace ~:~ with ":"again, but this time as a DF.

from pyspark.sql import functions as F
spark.read.option('header','true').csv(tempPath).withColumn('FullLabel',F.regexp_replace(F.col('FullLabel'),'~:~','":"')).show(1, False)

+----------+-----------------------------------------------+----+
|FullName  |FullLabel                                      |Type|
+----------+-----------------------------------------------+----+
|TEST.slice|Consideration":"Verde (Spar Verde, Fonte Verde)|Test|
+----------+-----------------------------------------------+----+

How to read csv file with additional comma in quotes using pyspark?

2 Answers2

Linked