Java String.startsWith() "seems" not working for the first line of a text file

Question

I have a text file like this, and I want to parse information from the text file.

#title キミと☆Are You Ready？
#artist トライクロニカ
#mobile deresimu
#easy 0
#normal 22
#hard 27
#tag SHOW BY ROCK!!
#preset all

I used this code to parse it.

File infoFile = new File(dir, "info.txt");
//parse info.txt
String songName="?";
String artist = "?";
int difficulties[] = new int[5];

try {
    BufferedReader br = new BufferedReader(new FileReader(infoFile));
    String line = br.readLine();
    while (line != null) {
        Log.v(TAG, "line=" + line);
        //I hate BOM!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
        /*
        <a href="http://www.faqs.org/rfcs/rfc3629.html">RFC 3629 - UTF-8, a transformation format of ISO 10646</a>
        *
        * <p>The
         * <a href="http://www.unicode.org/unicode/faq/utf_bom.html">Unicode FAQ</a>
    * defines 5 types of BOMs:<ul>
        * <li><pre>00 00 FE FF  = UTF-32, big-endian</pre></li>
        * <li><pre>FF FE 00 00  = UTF-32, little-endian</pre></li>
         * <li><pre>FE FF        = UTF-16, big-endian</pre></li>
        * <li><pre>FF FE        = UTF-16, little-endian</pre></li>
         * <li><pre>EF BB BF     = UTF-8</pre></li>
        * </ul></p>
        *
        * https://stackoverflow.com/questions/1835430/byte-order-mark-screws-up-file-reading-in-java
         */
        line=line.replace("\u00EF\u00BB\u00BF", "");
        line=line.replace("\u0000 \u0000 \u00FE \u00FF","");
        line=line.replace("\u00FF \u00FE \u0000 \u0000","");
        line=line.replace("\u00FE \u00FF","");
        line=line.replace("\u00FF \u00FE","");
        if (line.startsWith("#title")) {
            Log.v(TAG, "startswith");
            line = line.replace("#title ", "").trim();
            songName = line;
        } else if (line.startsWith("#artist")) {
            line = line.replace("#artist ", "").trim();
            artist = line;
        } else if (line.startsWith("#easy")) {
            difficulties[0] = Integer.parseInt(line.replace("#easy ", "").trim());

        } else if (line.startsWith("#normal")) {
            difficulties[1] = Integer.parseInt(line.replace("#normal ", "").trim());

        } else if (line.startsWith("#hard")) {
            difficulties[2] = Integer.parseInt(line.replace("#hard ", "").trim());
        } else if (line.startsWith("#master")) {
            difficulties[3] = Integer.parseInt(line.replace("#master ", "").trim());
        } else if (line.startsWith("#apex")) {
            difficulties[4] = Integer.parseInt(line.replace("#apex ", "").trim());
            continue;
        }
        line = br.readLine();
    }
} catch (IOException | NumberFormatException e) {
    throw new RuntimeException(e);
}
//info.txt parse done.
Log.v(TAG, "Info.txt parse done.");
Log.v(TAG, "Song name=" + songName);
Log.v(TAG, "Difficulties=" + Arrays.toString(difficulties));
Log.v(TAG, "Artist=" + artist);
Log.v(TAG, "Folder=" + dir.getName());

Parsing all the other lines is OK, except for the first line. if (line.startsWith("#title")) { seems never be true to the given text file. When I changed startsWith to contains, it works.

Firstly I thought that it was a BOM problem, so I added the 5 lines removing BOM sequences. However it didn't work. The variable songName is always "?" when I use startsWith for the first line.

Any clues why this code cannot match the #title? Thanks.

Logcat output:

2019-03-10 23:00:22.872 23600-23600/sma.rhythmtapper V/NoteFile: line=#title キミと☆Are You Ready？
2019-03-10 23:00:22.872 23600-23600/sma.rhythmtapper V/NoteFile: line=#artist トライクロニカ
2019-03-10 23:00:22.872 23600-23600/sma.rhythmtapper V/NoteFile: line=#mobile deresimu
2019-03-10 23:00:22.873 23600-23600/sma.rhythmtapper V/NoteFile: line=#easy 0
2019-03-10 23:00:22.873 23600-23600/sma.rhythmtapper V/NoteFile: line=#normal 22
2019-03-10 23:00:22.873 23600-23600/sma.rhythmtapper V/NoteFile: line=#hard 27
2019-03-10 23:00:22.874 23600-23600/sma.rhythmtapper V/NoteFile: line=#tag SHOW BY ROCK!!
2019-03-10 23:00:22.876 23600-23600/sma.rhythmtapper V/NoteFile: line=#preset all
2019-03-10 23:00:22.876 23600-23600/sma.rhythmtapper V/NoteFile: Info.txt parse done.
2019-03-10 23:00:22.876 23600-23600/sma.rhythmtapper V/NoteFile: Song name=?
2019-03-10 23:00:22.877 23600-23600/sma.rhythmtapper V/NoteFile: Difficulties=[0, 22, 27, 0, 0]
2019-03-10 23:00:22.877 23600-23600/sma.rhythmtapper V/NoteFile: Artist=トライクロニカ
2019-03-10 23:00:22.877 23600-23600/sma.rhythmtapper V/NoteFile: Folder=キミと☆Are You Ready？

EDIT

I located the problem by printing the byte sequence to logcat. It said:

"#title キミと☆Are You Ready？" -> [-17, -69, -65, 35, 116, 105, 116, 108, 101, 32, -29, -126, -83, -29, -125, -97, -29, -127, -88, -30, -104, -122, 65, 114, 101, 32, 89, 111, 117, 32, 82, 101, 97, 100, 121, -17, -68, -97]

"#title" -> [35, 116, 105, 116, 108, 101]

So I need to remove -17, -69, -65 from the line variable. How can I achieve the goal without using an external library?

Cannot reproduce. In Clojure, `(.startsWith "#title キミと☆Are You Ready？" "#title")` is `true`. You'll need to reduce this down to a [mcve]. — Carcigenicate, Mar 10 '19 at 14:20
It works for me as well. You should use debugger for this small part of code. — Adam Macierzyński, Mar 10 '19 at 14:23
@Boken 2019-03-10 23:00:22.876 23600-23600/sma.rhythmtapper V/NoteFile: Info.txt parse done. 2019-03-10 23:00:22.876 23600-23600/sma.rhythmtapper V/NoteFile: Song name=キミと☆Are You Ready？ 2019-03-10 23:00:22.877 23600-23600/sma.rhythmtapper V/NoteFile: Difficulties=[0, 22, 27, 0, 0] 2019-03-10 23:00:22.877 23600-23600/sma.rhythmtapper V/NoteFile: Artist=トライクロニカ 2019-03-10 23:00:22.877 23600-23600/sma.rhythmtapper V/NoteFile: Folder=キミと☆Are You Ready — KYHSGeekCode, Mar 10 '19 at 14:24
So it's working. I can see `Song name=キミと☆Are You Ready？` in your output. — Adam Macierzyński, Mar 10 '19 at 14:25
That code won't remove BOMs. `\u00EF` is *not* a single byte containing 0xEF. And you seem to have added space characters into some of your strings. — rici, Mar 10 '19 at 14:31
@rici thank you for pointing that(space) me out, but how can I remove a single byte? — KYHSGeekCode, Mar 10 '19 at 14:33
Seems like the BOM problem, as I can see line=#title キミと☆Are You Ready？ 2019-03-10 23:37:31.968 28182-28182/sma.rhythmtapper V/NoteFile: [-17, -69, -65, 35, 116, 105, 116, 108, 101, 32, -29, -126, -83, -29, -125, -97, -29, -127, -88, -30, -104, -122, 65, 114, 101, 32, 89, 111, 117, 32, 82, 101, 97, 100, 121, -17, -68, -97] 2019-03-10 23:37:31.968 28182-28182/sma.rhythmtapper V/NoteFile: [35, 116, 105, 116, 108, 101] for "#title キミと☆Are You Ready?" and "#title', respectively. — KYHSGeekCode, Mar 10 '19 at 14:46
@rici Do you know how to remove BOM without using any external library? — KYHSGeekCode, Mar 10 '19 at 14:46

score 1 · Answer 1 · answered Mar 10 '19 at 15:02

1

The suspicion that BOM caused the problem was true.

Plus, I changed the BOM removing code to this:

line=line.replace("\uEFBB\u00BF", "");
line=line.replace("\u0000\uFEFF","");
line=line.replace("\uFFFE\u0000","");
line=line.replace("\uFEFF","");
line=line.replace("\uFFFE","");

Be careful for

the whitespace
\u00EF != byte 0xEF

Thank you everybody who tried to help me, and hope that others who may have the same issue get help from this post.

answered Mar 10 '19 at 15:02

KYHSGeekCode

1,068
2
12
30

1) If you know that the input is encoded in UTF-8, you only need to remove the UTF-8 BOM. 2) But if you _don't_ know what the encoding is, removing all kinds of BOMs also erases the necessary information about the encoding! – Mr Lister Mar 11 '19 at 11:06

Java String.startsWith() "seems" not working for the first line of a text file

1 Answers1