Read file and write file which has characters in UTF - 8 (different language)

Question

I have a file which has characters like: " Joh 1:1 ஆதியிலே வார்த்தை இருந்தது, அந்த வார்த்தை தேவனிடத்திலிருந்தது, அந்த வார்த்தை தேவனாயிருந்தது. "

www.unicode.org/charts/PDF/U0B80.pdf‎

When I use the following code:

bufferedWriter = new BufferedWriter (new OutputStreamWriter(System.out, "UTF8"));

The output is boxes and other weird characters like this:

"�P�^��O֛��;�<�aYՠ؛"

Can anyone help?

these are the complete codes:

File f=new File("E:\\bible.docx");
        Reader decoded=new InputStreamReader(new FileInputStream(f), StandardCharsets.UTF_8);
        bufferedWriter = new BufferedWriter (new OutputStreamWriter(System.out, StandardCharsets.UTF_8));
        char[] buffer = new char[1024];
        int n;
        StringBuilder build=new StringBuilder();
        while(true){
            n=decoded.read(buffer);
            if(n<0){break;}
            build.append(buffer,0,n);
            bufferedWriter.write(buffer);
        }

enter image description here

The StringBuilder value shows the UTF characters but when displaying it in the window it shows as boxes..

Found the Answer to the problem!!! The Encoding is Correct (i.e UTF-8) Java reads the file as UTF-8 and the String characters are UTF-8, The problem is that there is no font to display it in netbeans' output panel. After changing the font for the output panel (Netbeans->tools->options->misc->output tab) I got the expected result. The same applies when it is displayed in JTextArea(font needs to be changed). But we can't change font the windows' cmd prompt.

How do you read the file? do you have the code you use for reading? — morgano, Aug 01 '13 at 04:00
You're providing the charset name as a string literal. The name, according to the documentation, is "UTF-8". — Zec, Aug 01 '13 at 04:07
Verify in a debugger that the strings contain the Unicode characters you expect. Then verify that the output device you use, support UTF8. — Thorbjørn Ravn Andersen, Aug 01 '13 at 04:19
To read a `docx` file, you need a `docx` reader. You cannot read it as if it were plain text. The problem is not the language, it is the file format. — Peter Lawrey, Aug 01 '13 at 06:05
Found the Answer to the problem; The Encoding is Correct (i.e UTF-8) — Alfa, Aug 06 '13 at 09:30

erickson · Accepted Answer · 2013-08-01T04:18:10.380

5

Because your output is encoded in UTF-8, but still contains the replacement character (U+FFFD, �), I believe the problem occurs when you read the data.

Make sure that you know what encoding your input stream uses, and set the encoding for the InputStreamReader according. If that's Tamil, I would guess it's probably in UTF-8. I don't know if Java supports TACE-16. It would look something like this…

StringBuilder buffer = new StringBuilder();
try (InputStream encoded = ...) {
  Reader decoded = new InputStreamReader(encoded, StandardCharsets.UTF_8);
  char[] buffer = new char[1024];
  while (true) {
    int n = decoded.read(buffer);
    if (n < 0)
      break;
    buffer.append(buffer, 0, n);
  }
}
String verse = buffer.toString();

edited Aug 01 '13 at 04:18

answered Aug 01 '13 at 04:09

erickson

265,237
58
395
493

@Zec If you mean UTF8 instead of UTF-8, no. UTF8 is an alias for the UTF-8 encoding. If the encoding isn't found, most APIs will throw an `UnsupportedEncodingException` – erickson Aug 01 '13 at 04:19
Got it. Thanks. I have no business answering Java questions anyway. – Zec Aug 01 '13 at 04:23
File f=new File("E:\\bible.docx"); Reader decoded=new InputStreamReader(new FileInputStream(f), StandardCharsets.UTF_8); bufferedWriter = new BufferedWriter (new OutputStreamWriter(System.out, StandardCharsets.UTF_8)); char[] buffer = new char[1024]; int n; StringBuilder build=new StringBuilder(); while(true){ n=decoded.read(buffer); if(n<0){break;} build.append(buffer,0,n); bufferedWriter.write(buffer); } – Alfa Aug 01 '13 at 05:29
@Alfa The easiest way to see if the input decoding is correct is to look at the decoded characters in memory with a debugger. If you aren't familiar with your debugger, you could print the numeric value of some of the characters. They should be in the range 0x0B80-0x0BFF – erickson Aug 01 '13 at 05:35
Also, are you sure the input is UTF-8 encoded? That was a guess on my part. I'm not familiar with the encodings used for Tamil. Is the document actually Microsoft Word's XML format? If so, what encoding is specified in the XML? – erickson Aug 01 '13 at 05:38
char array has the UTF chararcters in it, – Alfa Aug 01 '13 at 07:31
I can exactly copy from input file to output file.. But I cudn't display the characters in system stream (System.out) using both NetBeans as well as in Command Prompt.. I don't know y? – Alfa Aug 01 '13 at 12:50
If that's the case, then it was probably just your console settings. – erickson Aug 01 '13 at 17:33

score 1 · Answer 2 · answered Aug 01 '13 at 13:03

System.out is too near to the operating system, to be versatile enough. In your case, the NetBeans console probably is using the operating system encoding, and IDE picked font.

Write to a file first. If you make it HTML, you can even double click it, and specify internally the right encoding. Mind using "UTF-8" then, as "UTF8" is Java specific ("UTF-8" can be used in Java too). Maybe with JDesktop.getDesktop().open("... .html");.

A small JFrame with a JTextPane would do too.

score 0 · Answer 3 · edited Nov 24 '15 at 15:52

0

It turns out that Tamil is encoded in 16 bits, so just use UTF-16 instead of UTF-8. By doing that I was able to print Tamil text in the Eclipse console.

edited Nov 24 '15 at 15:52

E-Riz

31,431
9
97
134

answered Nov 24 '15 at 15:09

Mohammed Muzzamil

1
1

Read file and write file which has characters in UTF - 8 (different language)

3 Answers3

Linked