3

I´m trying to parse one xml but it shows a error, if I put a system.out.println to the String i see it.

before

<?xml version="1.0" 

after

?<?xml version="1.0"

I´m changing the charset to UTF-8 but didn´t works, so, what should I do?

Diego Macario
  • 1,240
  • 2
  • 21
  • 33
  • Is it a [BOM](http://en.wikipedia.org/wiki/Byte_order_mark)? – Sotirios Delimanolis Nov 23 '13 at 21:05
  • In my search in google, it seens to be, but the whole `String` of my code is a invoice, I want to parse, but `Sax` makes a `exception` – Diego Macario Nov 23 '13 at 21:07
  • 1
    More specifically, it's a BOM that has been decoded using the wrong encoding. If the file is read as UTF-8, then the BOM is interpreted as a single zero-width space character, or removed entirely by the software reading the file. If you read the file using an 8-bit encoding, you get three unusal characters as in the first example. – Guffa Nov 23 '13 at 21:12
  • So what should I do? Looking in notepead++ it shows UTF-8 – Diego Macario Nov 23 '13 at 21:21

3 Answers3

4

You have a UTF-8 string (which is why Notepad++ is recognizing it as such), but UTF-8 doesn't require a BOM. Some programs produce it; some don't. This leads to occasional confusion when reading files - some readers (like the one you're using in your Java code) don't recognize and ignore it. I'd recommend something like the accepted answer to this question or this one for removing it. Make sure you implement a check to determine if the first 3 bytes actually are a BOM before removing them from all incoming strings.

Community
  • 1
  • 1
Josh
  • 1,563
  • 11
  • 16
  • I read and I think that i need to verify if the begin of the file has the BOM and if have, remove it. – Diego Macario Nov 23 '13 at 22:35
  • If you're parsing the XML with an XML library, it should tolerate the BOM (it's in the rules of XML that you have to accept a BOM in UTF-8). It's not impossible that somewhere along the line of producing the file you actually ended up with two BOMs. – Jon Hanna Nov 23 '13 at 23:48
  • The "rules of XML" have only said this since the 3rd edition, and some XML parsers are older than that. – Michael Kay Nov 24 '13 at 13:02
  • @user2283439 Yep; that's what I suggested. Your XML parser might not be doing this for you, so this should be your first step (unless you're somehow reading the file in Java as ASCII/Latin-1/CP-1252 text before you feed it to your parser; that would cause this problem too). – Josh Nov 24 '13 at 18:23
2

For someone who wants to parse a xml and is having some problem with parse because of BOM this code above worked to me.

You can use API from apache BomInpustStream, it does the job for you, I had this problem, and you can trust, using this API will be much easier. A tip for you when parse a XML, you will need to get this as a array of bytes, check with the API suggested, and later parse to String in the charset UTF-8, in this way you will not lost the accents..

Piece of code to transform a source in inputStream

String source = FileUtil.takeOffBOM(IOUtils.toInputStream(attachment.getValue()));

Method to take off the BOM

public static String takeOffBOM(InputStream inputStream) throws IOException {
    BOMInputStream bomInputStream = new BOMInputStream(inputStream);
    return IOUtils.toString(bomInputStream, "UTF-8");
}
Diego Macario
  • 1,240
  • 2
  • 21
  • 33
1

A lot of utilities produce such initial odd character.

You may use java code to skip any character before the first "<". If your xml file is yours, you can fix it for good with, for example:

vi # no filename here, we need first to get in binary mode.
:set binary
:e filename.containing.your.xml
dt<:w
:q!
user2987828
  • 1,116
  • 1
  • 10
  • 33