Reading csv with doublequotes with pyspark (Spark 2.1.1)

Question

I have a couple of quite large csv files (several GB) which uses doublequotes i.e. it looks smth like this

first field,"second, field","third ""field"""

For performance reasons I would like to transform them into parquet files and then perform further analysis and transformation steps. For this I use the build in pyspark functionality for reading csv i.e.

df = spark.read.csv(file_name, schema=schema, escape='"')
df.write.parquet(base_dir+"/parquet/"+name, partitionBy="year")

I could not find any particular option for doublequotes when reading csv for spark so as you can see I used the " as an escape character.

So far it seems to work as there are no newlines in the texfile (this afak is not supported by the spark csv reader), however I have a hunch that this might not be the correct way to deal with it. Any thoughts or recommendations?

As the files are quite large performance would be also an issue, so using rdd and map seems to impose a high performance cost.

if you have new lines in the textfile, you can use the univocity parser with the spark csv. Using the option `parserlib` present. https://github.com/databricks/spark-csv — Rajat Mishra, Jun 21 '17 at 09:30
Link to the univocity parser https://github.com/uniVocity/univocity-parsers — Rajat Mishra, Jun 21 '17 at 09:35
hmm okay I think your links refer to Spark 1.x and not Spark 2.x where the module is incorporated into Spark already. Do you know how it works with Spark 2.2? — Peter D, Jun 21 '17 at 09:51
or is spark.read.csv and spark.read.format("com.databricks.spark.csv") smth different and thus I can use the option("parselib","...") — Peter D, Jun 21 '17 at 09:52
they both are same. Spark-csv package was bundled in Spark version > 2.0. Earlier it had to added externally. — Rajat Mishra, Jun 21 '17 at 09:57
this answer can help you more.. how use the options in spark version > 2.0 https://stackoverflow.com/a/39533431/7148638 — Rajat Mishra, Jun 21 '17 at 10:03

Reading csv with doublequotes with pyspark (Spark 2.1.1)

0 Answers0