0

I have a text file like this, and I want to parse information from the text file.

#title キミと☆Are You Ready?
#artist トライクロニカ
#mobile deresimu
#easy 0
#normal 22
#hard 27
#tag SHOW BY ROCK!!
#preset all

I used this code to parse it.

File infoFile = new File(dir, "info.txt");
//parse info.txt
String songName="?";
String artist = "?";
int difficulties[] = new int[5];

try {
    BufferedReader br = new BufferedReader(new FileReader(infoFile));
    String line = br.readLine();
    while (line != null) {
        Log.v(TAG, "line=" + line);
        //I hate BOM!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
        /*
        <a href="http://www.faqs.org/rfcs/rfc3629.html">RFC 3629 - UTF-8, a transformation format of ISO 10646</a>
        *
        * <p>The
         * <a href="http://www.unicode.org/unicode/faq/utf_bom.html">Unicode FAQ</a>
    * defines 5 types of BOMs:<ul>
        * <li><pre>00 00 FE FF  = UTF-32, big-endian</pre></li>
        * <li><pre>FF FE 00 00  = UTF-32, little-endian</pre></li>
         * <li><pre>FE FF        = UTF-16, big-endian</pre></li>
        * <li><pre>FF FE        = UTF-16, little-endian</pre></li>
         * <li><pre>EF BB BF     = UTF-8</pre></li>
        * </ul></p>
        *
        * https://stackoverflow.com/questions/1835430/byte-order-mark-screws-up-file-reading-in-java
         */
        line=line.replace("\u00EF\u00BB\u00BF", "");
        line=line.replace("\u0000 \u0000 \u00FE \u00FF","");
        line=line.replace("\u00FF \u00FE \u0000 \u0000","");
        line=line.replace("\u00FE \u00FF","");
        line=line.replace("\u00FF \u00FE","");
        if (line.startsWith("#title")) {
            Log.v(TAG, "startswith");
            line = line.replace("#title ", "").trim();
            songName = line;
        } else if (line.startsWith("#artist")) {
            line = line.replace("#artist ", "").trim();
            artist = line;
        } else if (line.startsWith("#easy")) {
            difficulties[0] = Integer.parseInt(line.replace("#easy ", "").trim());

        } else if (line.startsWith("#normal")) {
            difficulties[1] = Integer.parseInt(line.replace("#normal ", "").trim());

        } else if (line.startsWith("#hard")) {
            difficulties[2] = Integer.parseInt(line.replace("#hard ", "").trim());
        } else if (line.startsWith("#master")) {
            difficulties[3] = Integer.parseInt(line.replace("#master ", "").trim());
        } else if (line.startsWith("#apex")) {
            difficulties[4] = Integer.parseInt(line.replace("#apex ", "").trim());
            continue;
        }
        line = br.readLine();
    }
} catch (IOException | NumberFormatException e) {
    throw new RuntimeException(e);
}
//info.txt parse done.
Log.v(TAG, "Info.txt parse done.");
Log.v(TAG, "Song name=" + songName);
Log.v(TAG, "Difficulties=" + Arrays.toString(difficulties));
Log.v(TAG, "Artist=" + artist);
Log.v(TAG, "Folder=" + dir.getName());

Parsing all the other lines is OK, except for the first line. if (line.startsWith("#title")) { seems never be true to the given text file. When I changed startsWith to contains, it works.

Firstly I thought that it was a BOM problem, so I added the 5 lines removing BOM sequences. However it didn't work. The variable songName is always "?" when I use startsWith for the first line.

Any clues why this code cannot match the #title? Thanks.

Logcat output:

2019-03-10 23:00:22.872 23600-23600/sma.rhythmtapper V/NoteFile: line=#title キミと☆Are You Ready?
2019-03-10 23:00:22.872 23600-23600/sma.rhythmtapper V/NoteFile: line=#artist トライクロニカ
2019-03-10 23:00:22.872 23600-23600/sma.rhythmtapper V/NoteFile: line=#mobile deresimu
2019-03-10 23:00:22.873 23600-23600/sma.rhythmtapper V/NoteFile: line=#easy 0
2019-03-10 23:00:22.873 23600-23600/sma.rhythmtapper V/NoteFile: line=#normal 22
2019-03-10 23:00:22.873 23600-23600/sma.rhythmtapper V/NoteFile: line=#hard 27
2019-03-10 23:00:22.874 23600-23600/sma.rhythmtapper V/NoteFile: line=#tag SHOW BY ROCK!!
2019-03-10 23:00:22.876 23600-23600/sma.rhythmtapper V/NoteFile: line=#preset all
2019-03-10 23:00:22.876 23600-23600/sma.rhythmtapper V/NoteFile: Info.txt parse done.
2019-03-10 23:00:22.876 23600-23600/sma.rhythmtapper V/NoteFile: Song name=?
2019-03-10 23:00:22.877 23600-23600/sma.rhythmtapper V/NoteFile: Difficulties=[0, 22, 27, 0, 0]
2019-03-10 23:00:22.877 23600-23600/sma.rhythmtapper V/NoteFile: Artist=トライクロニカ
2019-03-10 23:00:22.877 23600-23600/sma.rhythmtapper V/NoteFile: Folder=キミと☆Are You Ready?

EDIT

I located the problem by printing the byte sequence to logcat. It said:

"#title キミと☆Are You Ready?" -> [-17, -69, -65, 35, 116, 105, 116, 108, 101, 32, -29, -126, -83, -29, -125, -97, -29, -127, -88, -30, -104, -122, 65, 114, 101, 32, 89, 111, 117, 32, 82, 101, 97, 100, 121, -17, -68, -97]

"#title" -> [35, 116, 105, 116, 108, 101]

So I need to remove -17, -69, -65 from the line variable. How can I achieve the goal without using an external library?

KYHSGeekCode
  • 1,068
  • 2
  • 12
  • 30
  • works fine for me – Ruslan Mar 10 '19 at 14:19
  • Cannot reproduce. In Clojure, `(.startsWith "#title キミと☆Are You Ready?" "#title")` is `true`. You'll need to reduce this down to a [mcve]. – Carcigenicate Mar 10 '19 at 14:20
  • What is expected output? – Boken Mar 10 '19 at 14:22
  • It works for me as well. You should use debugger for this small part of code. – Adam Macierzyński Mar 10 '19 at 14:23
  • @Boken 2019-03-10 23:00:22.876 23600-23600/sma.rhythmtapper V/NoteFile: Info.txt parse done. 2019-03-10 23:00:22.876 23600-23600/sma.rhythmtapper V/NoteFile: Song name=キミと☆Are You Ready? 2019-03-10 23:00:22.877 23600-23600/sma.rhythmtapper V/NoteFile: Difficulties=[0, 22, 27, 0, 0] 2019-03-10 23:00:22.877 23600-23600/sma.rhythmtapper V/NoteFile: Artist=トライクロニカ 2019-03-10 23:00:22.877 23600-23600/sma.rhythmtapper V/NoteFile: Folder=キミと☆Are You Ready – KYHSGeekCode Mar 10 '19 at 14:24
  • So it's working. I can see `Song name=キミと☆Are You Ready?` in your output. – Adam Macierzyński Mar 10 '19 at 14:25
  • @KYHSGeekCode it works. I'm receiving such output. – Boken Mar 10 '19 at 14:27
  • That code won't remove BOMs. `\u00EF` is *not* a single byte containing 0xEF. And you seem to have added space characters into some of your strings. – rici Mar 10 '19 at 14:31
  • @rici thank you for pointing that(space) me out, but how can I remove a single byte? – KYHSGeekCode Mar 10 '19 at 14:33
  • Seems like the BOM problem, as I can see line=#title キミと☆Are You Ready? 2019-03-10 23:37:31.968 28182-28182/sma.rhythmtapper V/NoteFile: [-17, -69, -65, 35, 116, 105, 116, 108, 101, 32, -29, -126, -83, -29, -125, -97, -29, -127, -88, -30, -104, -122, 65, 114, 101, 32, 89, 111, 117, 32, 82, 101, 97, 100, 121, -17, -68, -97] 2019-03-10 23:37:31.968 28182-28182/sma.rhythmtapper V/NoteFile: [35, 116, 105, 116, 108, 101] for "#title キミと☆Are You Ready?" and "#title', respectively. – KYHSGeekCode Mar 10 '19 at 14:46
  • @rici Do you know how to remove BOM without using any external library? – KYHSGeekCode Mar 10 '19 at 14:46

1 Answers1

1

The suspicion that BOM caused the problem was true.

Plus, I changed the BOM removing code to this:

line=line.replace("\uEFBB\u00BF", "");
line=line.replace("\u0000\uFEFF","");
line=line.replace("\uFFFE\u0000","");
line=line.replace("\uFEFF","");
line=line.replace("\uFFFE","");

Be careful for

  • the whitespace
  • \u00EF != byte 0xEF

Thank you everybody who tried to help me, and hope that others who may have the same issue get help from this post.

KYHSGeekCode
  • 1,068
  • 2
  • 12
  • 30
  • 1) If you know that the input is encoded in UTF-8, you only need to remove the UTF-8 BOM. 2) But if you _don't_ know what the encoding is, removing all kinds of BOMs also erases the necessary information about the encoding! – Mr Lister Mar 11 '19 at 11:06