0

I am using Apache POI to read the .docx file and after some operations write in .csv. The .docx file I am using is in french, but when I write the data in .csv it convert some of french characters in special characters. example Être un membre clé converted to Être un membre clé

Below code is used to write the file

        Path path = Paths.get(filePath);
        BufferedWriter bw = Files.newBufferedWriter(path);
        CSVWriter writer = new CSVWriter(bw);
        writer.writeAll(data);

which use UTF-8 as default.

While debugging I have checked before writing to .csv the data is as it is. but its get converted while writing? I have set default locale to Locale.FRENCH

Is I missed something?

Hitesh Ghuge
  • 793
  • 2
  • 10
  • 39
  • 2
    I think we need to also see how you are reading the docx file. I don't know much about that file format but it would be helpful to see that code also. – markspace Jun 21 '19 at 15:05
  • Using `FileInputStream`, but when I debug up to write code. the data looks as it it. – Hitesh Ghuge Jun 21 '19 at 15:08
  • Data, say student info. its came from client say school. – Hitesh Ghuge Jun 21 '19 at 15:22
  • 1
    Do you understand that when we say "where is `data` from?" that we need to *see the code* or we can't help you? Please post *all the code* that develops `data`. – markspace Jun 21 '19 at 15:25
  • 2
    You are creating a writer without explicitly specifying the character set. It is possibly writing it as UTF-8, while you are reading it as WIN1252 (or possibly if you made this error early, you are reading UTF-8 as WIN1252). – Mark Rotteveel Jun 21 '19 at 16:02

2 Answers2

3

I suspect it is Excel which reads the UTF-8 encoded CSV as ANSI. This happens when you simply open the CSV in Excel without using the text import wizard. Then Excel always expects ANSI if there is not a BOM at the beginning of the file. If you would open the CSV using a text editor which supports Unicode, all will be correct.

Example:

import java.io.BufferedWriter;

import java.nio.file.Path;
import java.nio.file.Paths;
import java.nio.file.Files;

import java.util.Locale;
import java.util.List;
import java.util.ArrayList;

import com.opencsv.CSVWriter;

class DocxToCSV {

 public static void main(String[] args) throws Exception {

  Locale.setDefault(Locale.FRENCH);

  List<String[]> data = new ArrayList<String[]>();
  data.add(new String[]{"F1", "F2", "F3", "F4"});
  data.add(new String[]{"Être un membre clé", "Être clé", "membre clé"});
  data.add(new String[]{"Être", "un", "membre", "clé"});

  Path path = Paths.get("test.csv");
  BufferedWriter bw = Files.newBufferedWriter(path);

  //bw.write(0xFEFF); bw.flush(); // write a BOM to the file

  CSVWriter writer = new CSVWriter(bw, ';', '"', '"', "\r\n");
  writer.writeAll(data);
  writer.flush();
  writer.close();

 }
}

Now if you open the test.csv using a text editor which supports Unicode, all will be correct. But if you open the same file using Excel it looks like:

enter image description here

Now we do the same but having

bw.write(0xFEFF); bw.flush(); // write a BOM to the file

active.

This results in Excel like this when test.csv is simply opened by Excel:

enter image description here

Of course the better approach is always using Excel's Text Import Wizard.

See also Javascript export CSV encoding utf-8 issue for the same problem.

Axel Richter
  • 56,077
  • 6
  • 60
  • 87
1

Être un membre clé "UTF8" = Être un membre clé "ANSI"

check the char code of how you are reading the final file.

cj rogers
  • 86
  • 6