1

I'm trying to do a small school practice about Java Text I/O and while trying to read a CSV file with name prefixes (a Dutch thing) and surnames I got a question mark in the beginning.

It's a small exercise where I need to add my code to an already existing project with 3 small files to practice the use of Text I/O, see project code: https://github.com/Remzi1993/klantenBestand

public void vulNamenLijst() {
    // TODO: Lees het bestand "resources/NamenlijstGroot.csv" en zet elke regel (<tussenvoegsel>,<achternaam>)
    // in de ArrayList namenLijst.

    file = new File("resources/NamenlijstGroot.csv");

    try (
            Scanner scanner = new Scanner(file);
    ) {
        while (scanner.hasNext()) {
            String line = scanner.nextLine();
            String[] values = line.split(",");
            String namePrefix = values[0];
            String surname = values[1];
            namenLijst.add(namePrefix + " " + surname);
        }
    } catch (FileNotFoundException e) {
        System.err.println("Data file doesn't exist!");
    } catch (Exception e) {
        System.err.println("Something went wrong");
        e.printStackTrace();
    }
}

I'm sorry for the use of Dutch and English at the same time in the code. I try to write my own code in English, but this code exercise already existed and I only needed to add some code with the //TODO to practice Text I/O.

This is what I get: Screenshot

My CSV file: CSV file screenshot

Remzi Cavdar
  • 135
  • 1
  • 13
  • 3
    I think that might be a BOM marker. See here https://stackoverflow.com/questions/4897876/reading-utf-8-bom-marker – funky Jul 13 '22 at 14:35
  • 1
    agreed with first comment.. probably the file was created with some Micrsosoft application (or formatted for one). Opening the file from the resources directory of given Github, we can see that it starts with a BOM (sort of): [screenshot](https://i.stack.imgur.com/yMZX8.png) (most text editors are smart enough to handle that, that is, hide it) – user16320675 Jul 13 '22 at 14:49
  • 1
    you can check if the first character is either `0xfeff` (or `0xffef`?) and, if so, ignore/remove it – user16320675 Jul 13 '22 at 14:56

3 Answers3

2

@funky is correct. Your file starts with a UTF8-BOM.

output of xxd:

00000000: efbb bf64 652c 4a6f 6e67 0a2c 4a61 6e73  ...de,Jong.,Jans
00000010: 656e 0a64 652c 5672 6965 730a 7661 6e20  en.de,Vries.van 

The first three bytes are: ef bb bf

Rob Audenaerde
  • 19,195
  • 10
  • 76
  • 121
  • Thank you so much! Could you please show me a short and simple solution how I could detect and remove the BOM if present? This would help a lot of Java developers :) – Remzi Cavdar Jul 13 '22 at 14:56
  • That's true haha :) – Remzi Cavdar Jul 13 '22 at 14:58
  • Thank you so much! I have provided a solution as an answer here :) – Remzi Cavdar Jul 13 '22 at 21:17
  • @RobAudenaerde I'm the author of this very exercise and can assure you that the BOM encoding is not part of it. Most likely, the file was opened in a text editor which tried to be smart and insert it for you :P – Lennard Fonteijn Jul 14 '22 at 10:09
  • 1
    @LennardFonteijn thanks for chipping in. I meant it as a jest, there are many ways to fix the BOM issue as can be found here on SO as well :) – Rob Audenaerde Jul 21 '22 at 08:12
1

To mitigate the BOM using a 'standard' component, you can use Apache's BOMInputStream. Note that BOMs come in multiple flavours (see here for more details), and this should handle them all reliably.

If you have a sizeable project, you may find you have the BOMInputStream in your project already via commons-io

Scanner will take an input stream (see here)

Brian Agnew
  • 268,207
  • 37
  • 334
  • 440
0

I found an easy solution:

final String UTF8_BOM = "\uFEFF";

if (line.startsWith(UTF8_BOM)) {
    line = line.substring(1);
}

A simple workable example:

File file = new File("resources/NamenlijstGroot.csv");

try (
    Scanner scanner = new Scanner(file, StandardCharsets.UTF_8);
) {
    while (scanner.hasNext()) {
        String line = scanner.nextLine().strip();

        final String UTF8_BOM = "\uFEFF";

        if (line.startsWith(UTF8_BOM)) {
            line = line.substring(1);
        }

        String[] values = line.split(",");
        String namePrefix = values[0];
        String surname = values[1];
        namenLijst.add(namePrefix + " " + surname);
    }
} catch (FileNotFoundException e) {
    System.err.println("Data file doesn't exist!");
} catch (Exception e) {
    System.err.println("Something went wrong");
    e.printStackTrace();
}
Remzi Cavdar
  • 135
  • 1
  • 13
  • 1
    You *should* explicitly set the encoding on the `Scanner` btw, now it uses the System default (which *might* be utf-8). – Rob Audenaerde Jul 13 '22 at 15:20
  • 1
    Like this: new Scanner(file, StandardCharsets.UTF_8); – Remzi Cavdar Jul 13 '22 at 15:29
  • The byte order mark only comes at the start of the file, note. And can come in multiple flavours – Brian Agnew Jul 13 '22 at 15:54
  • 1
    little *trick*: `Scanner scanner = new Scanner(f).useDelimiter("^\uFEFF|\\R");` and use `scanner.next()` to read next line - no need to check for `\uFEFF` – user16320675 Jul 13 '22 at 15:55
  • Fair point. But note it still doesn't handle the different BOM formats – Brian Agnew Jul 13 '22 at 16:01
  • 1
    I mean in general. I don't want people to stumble across this and not realise it's UTF-8 specific – Brian Agnew Jul 13 '22 at 16:11
  • 1
    @Brian The note/warning is valid for future readers, despite I expect thatit should be removed from UTF-16 formatted files by the `Sacnner`(quick test, it does if using `UTF-16` charset) || it is only(?) MS that adds an BOM to an UTF-8 formatted file – user16320675 Jul 13 '22 at 16:18