Pyspark loading UTF-8 csvs results in BOM misencoding

Question

I have an application that loads CSV (UTF-8 encoded, aka the default CSV encoding) files into PySpark dataframes. It's been doing this for about a year without any trouble, but all of a sudden is reading in the BOM as part of the file (the character is ).

Switching the encoding to UTF-16 or cp1252 does not seem to work, and it appears PySpark does not support UTF-8-sig encoding.

Has anyone encountered this problem recently? It seems as though Excel may have updated something recently that is causing this.

The code used to read the CSV is :

self.data = self.spark.read.csv(path=self.input_file,header=True, schema=self.schema)

What does the *code* look like? Did previous files contain a BOM at all? A UTF8 file without a BOM is *indistinguishable* from a US-ASCII file unless it contains non-US characters — Panagiotis Kanavos, May 21 '19 at 14:43
`It seems as though Excel may have updated` what does Excel have to do with the question? If the CSV is created by a *human* using Excel you should ask whoever does this job what options he/she uses when exporting the Excel sheet to a CSV. It could be that the user that does the job changed and the new one uses different settings. You can avoid that problem though if you use read the xlsx file directly — Panagiotis Kanavos, May 21 '19 at 14:45
In any case, you can change your code to handle the BOM [as shown in this possibly duplicate question](https://stackoverflow.com/questions/40310042/python-read-csv-bom-embedded-into-the-first-key) — Panagiotis Kanavos, May 21 '19 at 14:47
I assume that the previous files contained BOMs but dont know as we did not have this issue. I've asked the users and they did not change any part of the process, and I have also replicated their process and confirmed the error. I'm using PySpark to read this in, not Python, so that solution will not work for me. The actual code used to read in the file is: `self.data = self.spark.read.json(self.input_file,multiLine=True, schema=self.schema)` — user3711502, May 21 '19 at 15:02
If you just started having problems with BOMs, it means older files *didn't* have BOMs. As for replicating the process, what process, what steps, what settings? What *JSON*? The question asks about *CSV*, not JSON — Panagiotis Kanavos, May 21 '19 at 15:03
My mistake, copied the wrong code...the file is read in from an EC2 into a pyspark. `self.data = self.spark.read.csv(path=self.input_file,header=True, schema=self.schema)` — user3711502, May 21 '19 at 15:18

Pyspark loading UTF-8 csvs results in BOM misencoding

0 Answers0