10

I use Spark 2.1.

input csv file contains unicode characters like shown below

unicode-input-csv

While parsing this csv file, the output is shown like below

unicode-output-csv

I use MS Excel 2010 to view files.

The Java code used is

@Test
public void TestCSV() throws IOException {
    String inputPath = "/user/jpattnaik/1945/unicode.csv";
    String outputPath = "file:\\C:\\Users\\jpattnaik\\ubuntu-bkp\\backup\\bug-fixing\\1945\\output-csv";
    getSparkSession()
      .read()
      .option("inferSchema", "true")
      .option("header", "true")
      .option("encoding", "UTF-8")
      .csv(inputPath)
      .write()
      .option("header", "true")
      .option("encoding", "UTF-8")
      .mode(SaveMode.Overwrite)
      .csv(outputPath);
}

How can I get the output same as input?

Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420
Jyoti Ranjan
  • 703
  • 2
  • 13
  • 29
  • 1
    Thanks @Jacek, I checked the file encoding using file command and got to know the encoding of file is actually ISO-8859-1, So I parsed this file accordingly and got desired result. – Jyoti Ranjan May 22 '17 at 05:56

4 Answers4

18

I was able to read ISO-8859-1 using spark but when I store the same data to S3/hdfs back and read it, the format is converting to UTF-8.

ex: é to é

val df = spark.read.format("csv").option("delimiter", ",").option("ESCAPE quote", '"'). option("header",true).option("encoding", "ISO-8859-1").load("s3://bucket/folder")
Stanislav Mekhonoshin
  • 4,276
  • 2
  • 20
  • 25
Saida
  • 189
  • 1
  • 4
  • Can some one help me to save the ISO-8859-1 format data to aws S3/hdfs. – Saida Oct 24 '17 at 15:51
  • 1
    An answer is not the correct place to ask a question. You should create a new question. – mch Oct 24 '17 at 16:09
  • 1
    In case someone here is trying to read an Excel CSV file into Spark, there is an option in Excel to save the CSV using UTF-8 encoding. If you use this option to store the CSV, you don't need to specify the encoding as ISO-8859-1 – Omkar Neogi Jul 01 '19 at 16:05
4

My guess is that the input file is not in UTF-8 and hence you get the incorrect characters.

My recommendation would be to write a pure Java application (with no Spark at all) and see if reading and writing gives the same results with UTF-8 encoding.

Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420
2

.option('encoding', 'ISO-8859-1') worked for me. Acute, caret, cedilla accents among others appeared correctly.

0

Same problem, I solved encoding with "UTF-8"

  input_df = (spark.read
              .option("sep", sep)
              .option("header", "true")
              .option("encoding", "UTF-8")
              .csv(my_path)
              )
Enrique Benito Casado
  • 1,914
  • 1
  • 20
  • 40