2

Using OpenCSV to parse UTF-8 documents without BOM results in the first column not read. Giving as an input the same document content but encoded in UTF-8 with BOM works correctly.

I set specifically the charset to UTF-8

    fileInputStream = new FileInputStream(file);
    inputStreamReader = new InputStreamReader(fileInputStream, StandardCharsets.UTF_8);
    reader = new BufferedReader(inputStreamReader);
    HeaderColumnNameMappingStrategy<Bean> ms = new HeaderColumnNameMappingStrategy<Bean>();
    ms.setType(Bean.class);
    CsvToBean<Bean> csvToBean = new CsvToBeanBuilder<Bean>(reader).withType(Bean.class).withMappingStrategy(ms)
            .withSeparator(';').build();
    csvToBean.parse();

I've created a sample project where the issue can be reproduced: https://github.com/dajoropo/csv2beanSample

Running the Unit Test you can see how the UTF-8 file without BOM fails and with BOM works correctly.

The error comes in the second assertion, because the first column in not read. Result it:

[Bean [a=null, b=second, c=third]]

Any hint?

Daniel Rodríguez
  • 548
  • 1
  • 10
  • 30
  • What assertion fails - number of parsed lines or it is not equal to "first" (so what is it?)? – Alexander Pavlov May 17 '19 at 18:53
  • Also, OpenCSV is open source. You have small test which reproduces problem - just walk-through with a debugger and check what's wrong – Alexander Pavlov May 17 '19 at 18:56
  • @AlexanderPavlov the question is now updated specifying the error. I've tried debugging OpenCSV. I've seen that the first column is written wrongly in the fieldMap inside HeaderColumnNameMappingStrategy = [ ,A] instead of [A]. But I don't know why this happens. – Daniel Rodríguez May 20 '19 at 10:40

1 Answers1

5

If I open Bean class in you project and search for "B" then I can find one entry. If I search for "A" then I cannot :) It means you copy/pasted A with BOM header to Bean class. BOM header is not visible but still taken into account.

If I fix "A" then another test starts failing but I think you can fix it using BOMInputStream.

Check this question and answer Byte order mark screws up file reading in Java

It is known problem. You can use Apache Commons IO's BOMInputStream to solve it.

Just tried

    <dependency>
        <groupId>commons-io</groupId>
        <artifactId>commons-io</artifactId>
        <version>2.6</version>
    </dependency>

and

        inputStreamReader = new InputStreamReader(new BOMInputStream(fileInputStream), StandardCharsets.UTF_8);

and fixing

@CsvBindByName(column = "A")
private String a;

to exclude prefix from "A" makes both tests passing

Alexander Pavlov
  • 2,264
  • 18
  • 25
  • 1
    Thanks for this input! Seems to be a great library, but I can't make any difference with it. I've tried creating a BOMInputStream bomIn = new BOMInputStream(fileInputStream); and giving it as parameter to the InputStreamReader. I tried giving the BOM type or saying exclude or include, and also making bomIn.read() to skip the BOM... Nothing worked. One point that is maybe not clear: I have the issues reading when there is no BOM. Usually the problems are when a BOM is present. This is why I didn't find any answer so far that works. – Daniel Rodríguez May 20 '19 at 14:54
  • I re-wrote answer. You have to fix typo in your code + use `BOMInputStream` – Alexander Pavlov May 21 '19 at 18:18
  • Crazy issue... Thank you very much! – Daniel Rodríguez May 22 '19 at 07:13