0

I am trying to download contents from ftp folder. There is one xml file which starts with standardazed xml codes.

< ?xml version="1.0" encoding="utf-8"?>

when i read these files (using java.net.Socket)and get input stream and then try to convert to String, somehow i get some new charecters. And the whole xml document starts with '?' eg. "?< ?xml version="1.0" encoding="utf-8"?>....."

BufferedInputStream reader = new BufferedInputStream(sock.getInputStream());

Then i am getting a string from this reader using following code.

StringBuilder sb = new StringBuilder();

String line;
BufferedReader br = new BufferedReader(new InputStreamReader(reader));

while ((line = br.readLine()) != null) {
    sb.append(line);
}
System.out.println ("sb.toString()");

Not sure whats happening here. why am i getting some special charecters introduced ?Any suggestions would be appreciated

and then i just used following code to read the file and in console i see some special charecters

BufferedReader reader = new BufferedReader(new     FileReader("c:/Users/appd922/DocumentMeta06122014.xml"));
StringBuffer sb = new StringBuffer();
String line = null;
while ((line = reader.readLine()) != null) {
sb.append(line);
}

String output = sb.toString();
System.out.println("reading from file"+output);

I got output starting "reading from file< ?xml version.....

where am i getting these special charecters ?

Note- ignore the space in the xml file line given above. i could not write here with proper xmlwithout that space.

Leo
  • 6,480
  • 4
  • 37
  • 52
surya
  • 2,581
  • 3
  • 22
  • 34
  • If the file had been correctly read using the UTF-8 encoding, those first three bytes would be read as a single [byte order mark](http://en.wikipedia.org/wiki/Byte_order_mark) character, '\ufeff', a special Unicode character specifically intended to be the first character in a text document, where software can use it to determine the overall encoding of the document (including the byte order, if UTF-16 or UTF-32 is used). In your case, it is showing up as the three characters "" because your use of FileReader uses your system's default charset, which assumes each byte is one character. – VGR Jun 17 '14 at 01:30
  • Is it appropriate to include the BOM in an xml document with the prolog present? – Brett Okken Jun 17 '14 at 01:36

2 Answers2

2

Specify the encoding when creating InputStreamReader to read the file from the ftp, for example:

BufferedReader br = new BufferedReader(new InputStreamReader(reader, "utf-8"));

Otherwise, InputStreamReader uses default encoding. Also, specify the encoding when reading the downloaded file. FileReader uses default platform encoding. Use InputStreamReader and specify encoding, for example:

BufferedReader in = new BufferedReader(new InputStreamReader(new FileInputStream(filePath), "utf-8"));
tenorsax
  • 21,123
  • 9
  • 60
  • 107
1

Those characters are called BOM, Byte Order Mark. If you set the encoding of the InputStreamReader to 'UTF-8', you could see that they are interpreted as a single character, that is the BOM character.

Unfortunately, you have to handle this character yourself, because Java won't do it for you: java utf-8 and bom. Usually you just strip your stream of it. Good luck.

Community
  • 1
  • 1
xiaofeng.li
  • 8,237
  • 2
  • 23
  • 30
  • yes , i had to handle in code for BOM, i removed this charecter - "\uFEFF" . and it worked. Thanks – surya Jun 19 '14 at 20:01