I have an application that loads CSV (UTF-8 encoded, aka the default CSV encoding) files into PySpark dataframes. It's been doing this for about a year without any trouble, but all of a sudden is reading in the BOM as part of the file (the character is ).
Switching the encoding to UTF-16 or cp1252 does not seem to work, and it appears PySpark does not support UTF-8-sig encoding.
Has anyone encountered this problem recently? It seems as though Excel may have updated something recently that is causing this.
The code used to read the CSV is :
self.data = self.spark.read.csv(path=self.input_file,header=True, schema=self.schema)