4

how to write CSV File in UTF-8 via Apache CSV?

I am trying generate csv by following code where Files.newBufferedWriter() encode text into UTF-8 by default, but when I open generated text in excel there are senseless characters.

I create CSVPrinter like this:

CSVPrinter csvPrinter = new CSVPrinter(Files.newBufferedWriter(Paths.get(filePath)), CSVFormat.EXCEL);

next I set headers

csvPrinter.printRecord(headers);

and next in loop I print values into writer like this

csvPrinter.printRecord("value1", "valu2", ...);

I also tried upload file into online CSV lint validator and it tells that I am using ASCII-8BIT instead of UTF-8. What I did wrong?

Vadim Kotov
  • 8,084
  • 8
  • 48
  • 62
Denis Stephanov
  • 4,563
  • 24
  • 78
  • 174
  • 1
    ASCII characters are encoded the same way in UTF8 as they are encoded in ASCII. Your code only uses ASCII characters, so there's no way to distinguish between ASCII and UTF8 when looking at the file. – JB Nizet Jul 19 '19 at 12:27
  • instead of `CSVFormat.EXCEL` try using `CSVFormat.RFC4180` – Ryuzaki L Jul 19 '19 at 12:28
  • @Deadpool doesn't help :/ – Denis Stephanov Jul 19 '19 at 12:32
  • something like this `CSVPrinter printer = new CSVPrinter(new PrintWriter("nlp.csv", "UTF-8"), CSVFormat.EXCEL.withDelimiter("|".charAt(0)));` @DenisStephanov – Ryuzaki L Jul 19 '19 at 12:33
  • @Deadpool still not works – Denis Stephanov Jul 19 '19 at 12:39
  • is your file is of `csv` type? @DenisStephanov – Ryuzaki L Jul 19 '19 at 12:40
  • @Deadpool yes, it is – Denis Stephanov Jul 19 '19 at 12:50
  • Just tested this code and it successfully created a `csv` in UTF-8, cannot reproduce error. – Nexevis Jul 19 '19 at 12:53
  • Is Excel expecting a byte order mark on the UTF-8 file? If nothing else, that `0xEF 0xBB 0xBF` at the start will signal that the text is in UTF-8 and not ASCII. – rossum Jul 19 '19 at 13:06
  • @rossum can you please provide concrete solition how to write this signals via my CSVPrinter? – Denis Stephanov Jul 19 '19 at 13:20
  • AFAIR Excel does not use UTF-8 by default instead it expects ISO-8859-3. Thus you should create the BufferedWriter with Charset.forName("ISO-8859-3"). – M.F Jul 19 '19 at 13:27
  • @M.F I tired it and I got this error unmappablecharacterexception input length = 1 – Denis Stephanov Jul 19 '19 at 13:31
  • Then you should rather change Charset during import in Excel. Apart from this, do you want a "readability in Excel" or UTF-8? In case of UTF-8 you should be fine as Nexevis pointed out. – M.F Jul 19 '19 at 13:45
  • @M.F reading in other tools like notepad is fine, problem is that is not UTF-8, because this file is for external system which requires UTF-8 – Denis Stephanov Jul 19 '19 at 13:47
  • This [answer](https://stackoverflow.com/a/56679480/2185783) shows how to write CSV with `BufferedWriter` and UTF-8. `BufferedWriter writer = Files.newBufferedWriter(path, StandardCharsets.UTF_8);` – maximus Nov 20 '21 at 00:04

1 Answers1

17

Microsoft software tends to assume windows-12* or UTF-16LE charsets, unless the content starts with a byte order mark which the software will use to identify the charset. Try adding a byte order mark at the start of your file:

try (BufferedWriter writer = Files.newBufferedWriter(Paths.get(filePath))) {

    writer.write('\ufeff');

    CSVPrinter csvPrinter = new CSVPrinter(writer);

    //...
}
VGR
  • 40,506
  • 4
  • 48
  • 63
  • 3
    This may also be done as a header `CSVFormat.EXCEL.withHeader('\ufeff' + "Name", "Age")` so we can have `CSVPrinter` as part of the `try`. – Kenston Choi Dec 18 '20 at 04:31
  • Is this solution still works in ubuntu with the byte order mark, any idea ? – Rezgui Baha Eddinne Mar 29 '21 at 17:42
  • 1
    @RezguiBahaEddinne This will work on any system. UTF-8 is universal. However, *reading* the file in Ubuntu will depend on the tools you use. In my experience, many editors are smart enough to recognize a BOM, but text processing tools often are not. – VGR Mar 29 '21 at 17:58