I am working with Azure Databricks and PySpark 2.4.3 trying to build a robust approach to file import from Blob storage to a cluster file. Things mostly work but parsing is not raising errors as I expect.
I have a 7GB csv file that I know has a number records with issues that are causing rows to be skipped (found by reconciling the count of read records in the output parquet file written from the Dataframe versus the source CSV.) I am attempting to use the badRecordsPath option and there no output is being generated (that I can find.). Can anyone share advice on how to troubleshoot file loading when there is bad data - and to create a robust process that will handle parsing errors not permissively in the future?
One issue tacked is around embedded newlines where I've found wholeFile and multiline options have helped - but I am now having challenges in getting insight to what records are not being accepted.
The python code that I am using to load the file looks like this.
myDf = spark.read.format("csv")\
.option("inferSchema", "true")\
.option("header", "true")\
.option("wholeFile", "true")\
.option("multiline","true")\
.option("ignoreLeadingWhiteSpace", "true")\
.option("ignoreTrailingWhiteSpace", "true")\
.option("parserLib","univocity")\
.option('quote', '"')\
.option('escape', '"')\
.option("badRecordsPath","/tmp/badRecordsPath")\
.load(BLOB_FILE_LOCATION)
What I see is that about a half million records out of more than 10 million records being dropped. I am currently unable to easily tell which ones or know that failures are occurring or what they are (without exporting and comparing data which would be OK for a one-time load - but not acceptable for the production system). I've also tried the other read modes without luck (it always seems to be behaving like DROPMALFORMED is set which is not the case (even trying mode set "FAILFAST" in an experiment.)
Many thanks for any insight / advice.