I have a couple of quite large csv files (several GB) which uses doublequotes i.e. it looks smth like this
first field,"second, field","third ""field"""
For performance reasons I would like to transform them into parquet files and then perform further analysis and transformation steps. For this I use the build in pyspark functionality for reading csv i.e.
df = spark.read.csv(file_name, schema=schema, escape='"')
df.write.parquet(base_dir+"/parquet/"+name, partitionBy="year")
I could not find any particular option for doublequotes when reading csv for spark so as you can see I used the " as an escape character.
So far it seems to work as there are no newlines in the texfile (this afak is not supported by the spark csv reader), however I have a hunch that this might not be the correct way to deal with it. Any thoughts or recommendations?
As the files are quite large performance would be also an issue, so using rdd and map seems to impose a high performance cost.