How to parse CSV file with UTF-8 encoding?

Question

I use Spark 2.1.

input csv file contains unicode characters like shown below

While parsing this csv file, the output is shown like below

I use MS Excel 2010 to view files.

The Java code used is

@Test
public void TestCSV() throws IOException {
    String inputPath = "/user/jpattnaik/1945/unicode.csv";
    String outputPath = "file:\\C:\\Users\\jpattnaik\\ubuntu-bkp\\backup\\bug-fixing\\1945\\output-csv";
    getSparkSession()
      .read()
      .option("inferSchema", "true")
      .option("header", "true")
      .option("encoding", "UTF-8")
      .csv(inputPath)
      .write()
      .option("header", "true")
      .option("encoding", "UTF-8")
      .mode(SaveMode.Overwrite)
      .csv(outputPath);
}

How can I get the output same as input?

Thanks @Jacek, I checked the file encoding using file command and got to know the encoding of file is actually ISO-8859-1, So I parsed this file accordingly and got desired result. — Jyoti Ranjan, May 22 '17 at 05:56

score 18 · Answer 1 · edited Oct 24 '17 at 16:49

18

I was able to read ISO-8859-1 using spark but when I store the same data to S3/hdfs back and read it, the format is converting to UTF-8.

ex: é to Ã©

val df = spark.read.format("csv").option("delimiter", ",").option("ESCAPE quote", '"'). option("header",true).option("encoding", "ISO-8859-1").load("s3://bucket/folder")

edited Oct 24 '17 at 16:49

Stanislav Mekhonoshin

4,276
2
20
25

answered Oct 24 '17 at 15:49

Saida

189
1
4

Can some one help me to save the ISO-8859-1 format data to aws S3/hdfs. – Saida Oct 24 '17 at 15:51
1

An answer is not the correct place to ask a question. You should create a new question. – mch Oct 24 '17 at 16:09
1

In case someone here is trying to read an Excel CSV file into Spark, there is an option in Excel to save the CSV using UTF-8 encoding. If you use this option to store the CSV, you don't need to specify the encoding as ISO-8859-1 – Omkar Neogi Jul 01 '19 at 16:05

score 4 · Accepted Answer · answered May 22 '17 at 06:17

4

My guess is that the input file is not in UTF-8 and hence you get the incorrect characters.

My recommendation would be to write a pure Java application (with no Spark at all) and see if reading and writing gives the same results with UTF-8 encoding.

answered May 22 '17 at 06:17

Jacek Laskowski

72,696
27
242
420

score 2 · Answer 3 · answered Apr 27 '21 at 17:30

2

.option('encoding', 'ISO-8859-1') worked for me. Acute, caret, cedilla accents among others appeared correctly.

answered Apr 27 '21 at 17:30

Diogo Féria

41
2

score 0 · Answer 4 · answered Mar 31 '22 at 12:34

0

Same problem, I solved encoding with "UTF-8"

  input_df = (spark.read
              .option("sep", sep)
              .option("header", "true")
              .option("encoding", "UTF-8")
              .csv(my_path)
              )

answered Mar 31 '22 at 12:34

Enrique Benito Casado

1,914
1
20
40

How to parse CSV file with UTF-8 encoding?

4 Answers4

Linked