Spark reading German characters from csv: encoding UTF-8 vs iso8859-1

Question

Having a csv document with German characters in a Datalake enviroment. So we have to use spark read.

i.e German_characters.csv

"German_Text"
 Die 1949 gegründete Bundesrepublik Deutschland stellt die jüngste Ausprägung des 1871 erstmals begründeten.

Why is the encoding utf-8 its not working but iso8859-1 does ?

input_df = (spark.read
                .option("sep", sep)
                .option("header", "true")
                .option("encoding", "iso8859-1")
                .csv(path)
                )

Changing encoding iso for utf-8 or not encoding at all

.option("encoding", "UTF-8")

I m having that results

no_encodes

Die 1949 gegr�ndete Bundesrepublik Deutschland stellt die j�ngste Auspr�gung des 1871 erstmals begr�ndeten

utf-8

Die 1949 gegr�ndete Bundesrepublik Deutschland stellt die j�ngste Auspr�gung des 1871 erstmals begr�ndeten

iso8859

Die 1949 gegründete Bundesrepublik Deutschland stellt die jüngste Ausprägung des 1871 erstmals begründeten

I'm trying to find out why but I can't find it.

What is the difference between UTF-8 and ISO-8859-1?

@DKNY No really because there expone the problem, I already knows why the diamond appear(in that case cause the German character) my question its more focuses in Why one encoding method its working ant the orther no. And from stack overflow encoding utf-8 should work too but its not working. "UTF-8 is a multibyte encoding that can represent any Unicode character. ISO 8859-1 is a single-byte encoding that can represent the first 256 Unicode characters. Both encode ASCII exactly the same way" — Enrique Benito Casado, Apr 06 '22 at 15:19
Maybe the CSV's encoding is the issue. I checked with the string you provided and don't see any issue by reading it **without encoding** and **with encoding UTF-8**. Maybe try to create another file with some samples and try again? — pltc, Apr 06 '22 at 18:28
Hi @pltc what do you mean with "csv´s encoding" ? . The text you see is in the csv — Enrique Benito Casado, Apr 07 '22 at 06:57
Have you tried to create another CSV with the text above and test with your code? — pltc, Apr 07 '22 at 18:16

score 0 · Answer 1 · answered May 19 '23 at 14:28

0

in my case the reason was the formatting of the csv file. I use Azure Data Factory to convert SQL tables to csv files. I changed the format of the csv file in ADF and used the same format (in my case ISO-8859-1) in Azure Synapse to read the csv file with PySpark.

answered May 19 '23 at 14:28

Simon

11
2

As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community May 19 '23 at 20:30

Spark reading German characters from csv: encoding UTF-8 vs iso8859-1

1 Answers1