Characters get converted into special characters

Question

I am using Apache POI to read the .docx file and after some operations write in .csv. The .docx file I am using is in french, but when I write the data in .csv it convert some of french characters in special characters. example Être un membre clé converted to ÃŠtre un membre clÃ©

Below code is used to write the file

        Path path = Paths.get(filePath);
        BufferedWriter bw = Files.newBufferedWriter(path);
        CSVWriter writer = new CSVWriter(bw);
        writer.writeAll(data);

which use UTF-8 as default.

While debugging I have checked before writing to .csv the data is as it is. but its get converted while writing? I have set default locale to Locale.FRENCH

Is I missed something?

I think we need to also see how you are reading the docx file. I don't know much about that file format but it would be helpful to see that code also. — markspace, Jun 21 '19 at 15:05
Using `FileInputStream`, but when I debug up to write code. the data looks as it it. — Hitesh Ghuge, Jun 21 '19 at 15:08
Do you understand that when we say "where is `data` from?" that we need to *see the code* or we can't help you? Please post *all the code* that develops `data`. — markspace, Jun 21 '19 at 15:25
You are creating a writer without explicitly specifying the character set. It is possibly writing it as UTF-8, while you are reading it as WIN1252 (or possibly if you made this error early, you are reading UTF-8 as WIN1252). — Mark Rotteveel, Jun 21 '19 at 16:02

score 3 · Accepted Answer · answered Jun 21 '19 at 16:35

I suspect it is Excel which reads the UTF-8 encoded CSV as ANSI. This happens when you simply open the CSV in Excel without using the text import wizard. Then Excel always expects ANSI if there is not a BOM at the beginning of the file. If you would open the CSV using a text editor which supports Unicode, all will be correct.

Example:

import java.io.BufferedWriter;

import java.nio.file.Path;
import java.nio.file.Paths;
import java.nio.file.Files;

import java.util.Locale;
import java.util.List;
import java.util.ArrayList;

import com.opencsv.CSVWriter;

class DocxToCSV {

 public static void main(String[] args) throws Exception {

  Locale.setDefault(Locale.FRENCH);

  List<String[]> data = new ArrayList<String[]>();
  data.add(new String[]{"F1", "F2", "F3", "F4"});
  data.add(new String[]{"Être un membre clé", "Être clé", "membre clé"});
  data.add(new String[]{"Être", "un", "membre", "clé"});

  Path path = Paths.get("test.csv");
  BufferedWriter bw = Files.newBufferedWriter(path);

  //bw.write(0xFEFF); bw.flush(); // write a BOM to the file

  CSVWriter writer = new CSVWriter(bw, ';', '"', '"', "\r\n");
  writer.writeAll(data);
  writer.flush();
  writer.close();

 }
}

Now if you open the test.csv using a text editor which supports Unicode, all will be correct. But if you open the same file using Excel it looks like:

Now we do the same but having

bw.write(0xFEFF); bw.flush(); // write a BOM to the file

active.

This results in Excel like this when test.csv is simply opened by Excel:

Of course the better approach is always using Excel's Text Import Wizard.

See also Javascript export CSV encoding utf-8 issue for the same problem.

Thanks @Alex Richter, works like champ. And yes excel is my editor for csv. looking perfect in notepad++ — Hitesh Ghuge, Jun 24 '19 at 07:06

score 1 · Answer 2 · answered Jun 21 '19 at 15:59

1

check the char code of how you are reading the final file.

answered Jun 21 '19 at 15:59

cj rogers

86
6

Characters get converted into special characters

2 Answers2