11

I open Notepad (Windows) and write

Some lines with special characters
Special: Žđšćč

and go to Save As... "someFile.txt" with Encoding set to UTF-8.

In Java I have

FileInputStream fis = new FileInputStream(new File("someFile.txt"));
InputStreamReader isr = new InputStreamReader(fis, "UTF-8");
BufferedReader in = new BufferedReader(isr);

String line;
while((line = in.readLine()) != null) {                         
    printLine(line);
}
in.close();

But I get question marks and similar "special" characters. Why?

EDIT: I have this input (one line in .txt file)

665,Žđšćč

and this code

FileInputStream fis = new FileInputStream(new File(fileName));
InputStreamReader isr = new InputStreamReader(fis, "UTF-8");
BufferedReader in = new BufferedReader(isr);

String line;
while((line = in.readLine()) != null) {
    Toast.makeText(mContext, line, Toast.LENGTH_LONG).show();

    Pattern p = Pattern.compile(",");
    String[] article = p.split(line);

    Toast.makeText(mContext, article[0], Toast.LENGTH_LONG).show();
    Toast.makeText(mContext, Integer.parseInt(article[0]), Toast.LENGTH_LONG).show();
}
in.close();

And Toast output (for ones who aren't familiar with Android, Toast is just a method to show a pop-up on screen with particular text in it) is fine. Console shows "weird characters" (probably because of encoding in console window). But it fails at parsing an integer because console says this (warning: toast output is just fine) - Problem?

It seems like the String is containing some "weird" characters which Toast can't show/render but when I try to parse it, it crashes. Suggestions?

If I put ANSI in NotePad it works (integer parsing) and there are no weird chars as in the picture above, but of course my special characters aren't working.

merours
  • 4,076
  • 7
  • 37
  • 69
svenkapudija
  • 5,128
  • 14
  • 68
  • 96
  • 1
    What's in that printLine(line) function? – Will Jan 04 '11 at 19:56
  • `while((line = in.readLine()) != null)` - Does Java even let you do that? I thought in Java, assignments weren't considered expressions... – Eric Jan 04 '11 at 19:59
  • @Will printLine just prints it to my Debugger (Eclipse) - in this case second line becomes "01-04 20:01:23.394: VERBOSE/line(32246): Special: ŽÄÅ¡ÄÄ" – svenkapudija Jan 04 '11 at 20:02
  • while((line = in.readLine()) != null) - Yes you can do this Eric. The condition that the while loop is evaluating is (A != null) where A is the result of reading one line from the in stream. – xagyg Jan 04 '11 at 23:55
  • actually my problem is little different actually my file name is Žđšćč and i got error FileInputStream fis = new FileInputStream(new File("Žđšćč.txt")); plzz help – Bhanu Sharma Jul 01 '14 at 07:13

6 Answers6

17

It's the output console which doesn't support those characters. Since you're using Eclipse, you need to ensure that it's configured to use UTF-8 for this. You can do this by Window > Preferences > General > Workspace > Text File Encoding > set to UTF-8.

See also:


Update as per the updated question and the comments, apparently the UTF-8 BOM is the culprit. Notepad by default adds the UTF-8 BOM on save. It look like that the JRE on your HTC doesn't swallow that. You may want to consider to use the UnicodeReader example as outlined in this answer instead of InputStreamReader in your code. It autodetects and skips the BOM.

FileInputStream fis = new FileInputStream(new File(fileName));
UnicodeReader ur = new UnicodeReader(fis, "UTF-8");
BufferedReader in = new BufferedReader(ur);

Unrelated to the actual problem, it's a good practice to close resources in finally block so that you ensure that they will be closed in case of exceptions.

BufferedReader reader = null;
try {
    reader = new BufferedReader(new UnicodeReader(new FileInputStream(fileName), "UTF-8"));
    // ...
} finally {
    if (reader != null) try { reader.close(); } catch (IOException logOrIgnore) {}
}

Also unrelated, I'd suggest to put Pattern p = Pattern.compile(","); outside the loop, or even make it a static constant, because it's relatively expensive to compile it and it's unnecessary to do this everytime inside a loop.

Community
  • 1
  • 1
BalusC
  • 1,082,665
  • 372
  • 3,610
  • 3,555
  • It isn't because my SQLite INSERT also doesn't work. If I manually type in (on my HTC Desire) as input "žđšćč" and forward it to INSERT statement - it works fine. If I, however, use read function to read those same characters from my .txt file - crash. So, it's not just console output. Something else? – svenkapudija Jan 04 '11 at 20:19
  • I updated the code above...I really don't know what the heck is going on =/ – svenkapudija Jan 04 '11 at 20:57
  • Does it now look right on output console or not? Another cause can be that your SQLite JDBC driver and/or DB is not treating the characters as UTF-8. – BalusC Jan 04 '11 at 21:17
  • Here is partial answer - http://stackoverflow.com/questions/4599061/unable-to-parse-as-integer. Document (.txt file) IS UTF-8, but why my reader isn't reading it as UTF-8? – svenkapudija Jan 04 '11 at 23:58
2

Your code looks right - but a very common, and easy, error is to misstake what is printed to screen to what's in the String. Check with a debugger if the string isn't already read right.

Magnus
  • 2,016
  • 24
  • 32
1

Notepad does not save special symbols correctly. I had a similar problem and I used Notepad++ instead and selected UTf-8 encoding from there. When I did this, my program no longer crashed when applying String library methods to it unlike when I created the text file in Notepad.

user929404
  • 2,153
  • 1
  • 22
  • 27
0

Notepad may not be able to handle non-ascii characters. Try another text editor. If you want to stick to what's available in windows install, try wordpad.

Konstantin Komissarchik
  • 28,879
  • 6
  • 61
  • 61
0
"Not all sequences of bytes are valid UTF-8."

See

http://en.wikipedia.org/wiki/UTF-8

under "Invalid byte sequences" for specific details.

xagyg
  • 9,562
  • 2
  • 32
  • 29
0

Are you using the character the conversion as part of servlet request/response ? If yes, request.setEncoding("UTF-8")
or
response.setCharacterEncoding("UTF-8")

should solve your purpose.