java encoding issue while reading stream

Question

I am trying to download contents from ftp folder. There is one xml file which starts with standardazed xml codes.

< ?xml version="1.0" encoding="utf-8"?>

when i read these files (using java.net.Socket)and get input stream and then try to convert to String, somehow i get some new charecters. And the whole xml document starts with '?' eg. "?< ?xml version="1.0" encoding="utf-8"?>....."

BufferedInputStream reader = new BufferedInputStream(sock.getInputStream());

Then i am getting a string from this reader using following code.

StringBuilder sb = new StringBuilder();

String line;
BufferedReader br = new BufferedReader(new InputStreamReader(reader));

while ((line = br.readLine()) != null) {
    sb.append(line);
}
System.out.println ("sb.toString()");

Not sure whats happening here. why am i getting some special charecters introduced ?Any suggestions would be appreciated

and then i just used following code to read the file and in console i see some special charecters

BufferedReader reader = new BufferedReader(new     FileReader("c:/Users/appd922/DocumentMeta06122014.xml"));
StringBuffer sb = new StringBuffer();
String line = null;
while ((line = reader.readLine()) != null) {
sb.append(line);
}

String output = sb.toString();
System.out.println("reading from file"+output);

I got output starting "reading from fileï»¿< ?xml version.....

where am i getting these special charecters ?

Note- ignore the space in the xml file line given above. i could not write here with proper xmlwithout that space.

If the file had been correctly read using the UTF-8 encoding, those first three bytes would be read as a single [byte order mark](http://en.wikipedia.org/wiki/Byte_order_mark) character, '\ufeff', a special Unicode character specifically intended to be the first character in a text document, where software can use it to determine the overall encoding of the document (including the byte order, if UTF-16 or UTF-32 is used). In your case, it is showing up as the three characters "ï»¿" because your use of FileReader uses your system's default charset, which assumes each byte is one character. — VGR, Jun 17 '14 at 01:30
Is it appropriate to include the BOM in an xml document with the prolog present? — Brett Okken, Jun 17 '14 at 01:36

tenorsax · Answer 1 · 2014-06-17T01:24:40.210

Specify the encoding when creating InputStreamReader to read the file from the ftp, for example:

BufferedReader br = new BufferedReader(new InputStreamReader(reader, "utf-8"));

Otherwise, InputStreamReader uses default encoding. Also, specify the encoding when reading the downloaded file. FileReader uses default platform encoding. Use InputStreamReader and specify encoding, for example:

BufferedReader in = new BufferedReader(new InputStreamReader(new FileInputStream(filePath), "utf-8"));

score 1 · Accepted Answer · edited May 23 '17 at 12:31

1

Those characters are called BOM, Byte Order Mark. If you set the encoding of the InputStreamReader to 'UTF-8', you could see that they are interpreted as a single character, that is the BOM character.

Unfortunately, you have to handle this character yourself, because Java won't do it for you: java utf-8 and bom. Usually you just strip your stream of it. Good luck.

edited May 23 '17 at 12:31

Community

1
1

answered Jun 17 '14 at 01:41

xiaofeng.li

8,237
2
23
30

yes , i had to handle in code for BOM, i removed this charecter - "\uFEFF" . and it worked. Thanks – surya Jun 19 '14 at 20:01

java encoding issue while reading stream

2 Answers2